Django JOIN: raw sql vs. in memory - django

I have an SQL JOIN that does not translate well into the Django ORM syntax and I am wondering which alternative solution is better:
Use a raw sql query, or
Use two querysets and perform the join in memory to annotate one of them.
I can join the two querysets in linear time, so I don't think 2. would be significantly slower. But which one of the approaches is better conceptually?

The primary considerations are performance, readability/clarity of the code, and maintainability.
If performance is not an issue here as you say, then clarity and maintainability are the primary factors. Pure Python/Django code is typically clearer to read and follow its purpose than inlining SQL queries as strings in Python code. In my experience, SQL queries as strings are also harder to maintain, as they will not throw syntax errors if your model changes and the query is no longer valid. The issue will be found at runtime instead.
Ultimately the readability and maintainability issues mainly rely on you and any other maintainers, since it will need to be clear and maintainable to you. In my opinion, the SQL string approach is less clear and more brittle.

Related

Efficiency of using JSONFields vs. Pure SQL approach Django Postgres

I am wondering what is the difference in efficiency using JSONFields vs. a pure SQL approach in my Postgres DB.
I now know that I can query JSONFields like this:
MyModel.objects.filter(json__title="My title")
In pure SQL, it would look like this:
MyModel.objects.filter(title="My title")
Are these equal in efficiency?
Having separate columns for each thing is definitely more efficient.
The advantage of a JSONField is flexibility. You can store anything you want in there, and you don't have to change your database schema. But this comes at a cost. If you have a column that is a CharField with max 255 characters for example, then lots of time and effort will have gone into making a database that can optimise for that particular type (likewise for other types). With a JSONField however, it can be literally anything and it becomes very difficult to optimise a query (at the actual database level) for this.
Unless you have a good reason to use a JSON field (namely you need that level of flexibility) it is much much much better to go with separate columns for each of your fields. There are other advantages besides performance as well. You can define defaults, you can know for certain what types different variables are, which will make programming with them a whole heap easier and avoid a load of errors.

When should SQLite not be used for testing in Django if a different RDBMS(E.g. PostgreSQL) is used in production and development?

This article advises to use SQLite for tests even if you use another RDBMS (I use PostgreSQL) in development and production. I tried SQLite for one test case and it ran faster indeed (~18.8 times faster, 0.5s vs 9.4s!).
In which cases could using SQLite result in different test results than if I used PostgreSQL?
Only if I would test a piece of code that contains a raw SQL query?
Any time the query generator might produce queries that behave differently on different platforms despite the efforts of the query generator's platform abstractions. Differences in regular expressions, collations and sorting, different levels of strictness about aggregates and grouping, use of full-text search or other extension features, use of anything but the most utterly simple functions and operators, etc.
Also, as you noted, any time you run raw SQL.
It's moderately reasonable to run tests on SQLite during iterative development, but you really need to run them on the same DB you're going to deploy on before you push to production. Otherwise you'll get bitten by some query where different engines have different capabilities to prove transitive equality through joins and GROUP BY or are differently permissive of queries, so a query will work on one then fail on the other.
You should also test against PostgreSQL on a reasonable data set before pushing changes live in order to find obvious performance regressions that'll be an issue in production. It makes little sense to do this on SQLite, where often totally different queries will be fast or slow.
I'm surprised you're seeing the kind of speed difference you report. I'd want to look into why the tests run so much slower on PostgreSQL and what you can do about it, since in production it's clearly not going to have the same kind of performance difference. I wrote a bit about this in optimise PostgreSQL for fast testing.
The performance characteristics will be very different in most cases. Often faster. It's typically good for testing because the SQLite engine does not need to take into account multiple client access. SQLite only allowed one thread to access it at once. This greatly reduces a lot of the overhead and complexity compared to other RDBMSs.
As far as raw queries go, there are going to be lot of features that SQLite does not support compared to Postgres or another RDBMS. Stay away from raw queries as much as possible to keep your code portable. The exception will be when you need to optimize specific queries for production. In those cases you can keep a setting in settings.py to check if you are on production and run the generic filters instead of a raw query. There are many types of generic raw queries that will not need this sort of checking though.
Also, the fact that a SQLite DB is just a file, it makes it very simple to tear down and start over for testing.

When should I use C++ instead of SQL?

I am a C++ programmer who occasionally uses MySQL to work with databases, but my SQL knowledge is rather limited. However I am surely willing to change that.
At the moment I am trying to do analysis(!) on the data I have in my database solely with SQL queries. But I am about to give up, and instead import the data to C++ and do the analysis with C++ code.
I have discussed this with my colleagues, and they also push me to use C++, saying that SQL is not meant for complex analysis but mainly for importing (from the existing tables) and exporting (to new tables) data, and a little bit more such as merging data to - e.g. - joined tables.
Can somebody help me drawing a line? So I know when to switch to C++? Of course performance is also an issue.
What are indications that things get to complex in SQL? Or maybe I just take the wrong approach with designing the queries. Then where can I find tutorials, books, ... to take a better approach?
I hope this is not too vague. I am really a bit lost.
SQL excels at analyzing large sets of relational data.
The place to draw the line is the scale of your analysis.
If you analyze individual records one at a time, do it in your application.
If you analyze large sets of records as a unit, SQL is definitely the best tool for that job.
Row-by-row analysis is not something SQL is designed or optimized for very well. But, if you want to know something about a million-row group of data, do it in the database.
I have discussed this with my colleagues, and they also push me to use C++, saying that SQL is not meant for complex analysis but mainly for importing (from the existent tables) and exporting (to new tables) data, and a little bit more such as merging data to - e.g. - joined tables.
This is completely arbitrary. Learn SQL. There are a lot of resources available on the web for free.
You can do very complex analysis of data in SQL, provided you know how use the features that SQL offers.
SQL has features for doing relational operations, like joins and projections. Also for doing set operations like union, intersection, and restriction (subset). Also for doing basic arithmetic on numbers, like the four arithmetic operators, and built in functions like SQRT. Also statistical functions like COUNT, SUM, and AVG that can be combined with projections in very interesting ways. A good DBMS will let you extend the built in functions with your own functions written in C, C++ or maybe PL/SQL.
The power you get from these features depends on how well designed the database is. A well designed database conforms to the relational model, and should be relvant to your intended use of the data.
SQL code can be stored in the database in stored prodecures. It can be stored in SQL script files. And, as you already know, it can be embedded in application programs. In addition to SQL, you can use OLAP tools and report generators to do standard things with the data very easily.
The people who advise you to keep all of your processing in C++ sound like they have learned just enough to use a database like a big and stupid file system. A good DBMS is much more than that.
SQL is usually very efficient handling its own database (depends on the server implementation).
You should use queries to analyze the database.
The main reason for that would be the communication overhead.
Even if the server is on the local machine (remote servers would have obvious communication overhead), you'll still have to retrieve the stored information from the SQL server to your c++ program for analysis.
Now if you have 10000s of lines in the SQL you would have to get the SQL server to read them all and send them to your program where it would probably create a local copy of the data for you to work on.
If you let the SQL server do it with queries, you'll gain the complex optimizations it does according the kind of query you're executing, and in the end you can retrieve only a limited amount of data (the one you actually need) through the communication.
You made right decision to begin data analysis with SQL. Now, when you feel that your knowledge of SQL limits you, you have 2 choices: give up and switch back to familiar but not very efficient toolset (C++) or bring your level with SQL up.
It's possible that at some point SQL will become too complex too, but then C++ won't be the answer either - most likely some specialized tools.
In my opinion you should only perform analysis in C++ if no equivalent for the analysis function is provided by database server, As database servers are very smart and it is hard and almost imposible to beat the algorithm efficiency of analysis function of database server. Also bringing raw data to the application for performing analysis also includes lots of overheads.
If at some point plain SQL becomes overly complex native PL of the sever could be a good choice
I agree with JNK and Jochai, but disagree with Ascanio.
It's better to improve the knowledge in database systems.
Sql comes with it
So, this is something I've been thinking about and it seems to me that SQL, as just a platform/language for storing/manipulating data, should have no inherent advantage over a C++ or C library. It seems to me that theoretically you could build a C++ library just as efficient, if not more efficient, than SQL at doing this. In doing so, you would be able to build it from the ground up, in terms of how ints, chars, strings, and other data types are stored, and make it easier to interface with you particular application (like web development). You could even make it so that the queries could be done in a language like javascript (allowing web developers to focus on just learning one language really well).

Will the use of the built-in ORM in CF 9 increase db performance?

How will it or how will it not?
Appreciate it.
That's like asking if programming language A is faster than programming language B. The fact of the matter is that you can write poor code with either of them, and you can write good code with either of them.
As Stephen says, ORM is about improving development productivity - you don't have to pay the productivity cost of context switching between application code and SQL; and in some cases it offers application performance boosters.
However, if you're looking to "increase db performance" then ORM is not a silver bullet. I don't think that one (a silver bullet) exists.
Nothing can beat well written code (be it ORM or SQL) that has been analyzed and optimized.
Well no not really...
ORM is not about increasing the performance of your database. Its about how you manipulate that data on the application side.
It does have elements such as object caching built in which do help with performance of your application, but you still need to create a well structured and indexed database schema.

Even lighter than SQLite

I've been looking for a C++ SQL library implementation that is simple to hook in like SQLite, but faster and smaller. My projects are in games development and there's definitely a cutoff point between needing to pass the ACID test and wanting some extreme performance. I'm willing to move away from SQL string style queries, allowing it to be code driven, but I haven't found anything out there that provides SQL-like flexibility while also preferring performance over the ACID test.
I don't want to go re-inventing the wheel, and the idea of implementing an SQL library on my own is quite daunting, even if it's only going to be a simple subset of all the calls you could make.
I need the basic commands (SELECT, MODIFY, DELETE, INSERT, with JOIN, and WHERE), not data operations (like sorting, min, max, count) and don't need the database to be atomic, or even enforce consistency (I can use a real SQL service while I'm testing and debugging).
Are you sure that you have obtained the maximum speed available from SQLITE? Out of the box, SQLITE is extremely safe, but quite slow. If you know what you are doing, and are willing to risk db corruption on a disk crash, then there are several optimizations you can do that provide spectacular speed improvements.
In particular:
Switch off synchronization
Group writes into transactions
Index tables
Use database in memory
If you have not explored all of these, then you are likely running many times slower than you might.
I'm not sure you'll manage to find anything with better performances than SQL. Especially if you want operations like JOINs... Is SQLite speed really a problem? For simple requests it's usually faster than any full SGDB.
Don't you have an index problem?
About size, it's not event 1Meg extra in the binary file, so I'm a bit suprised it's a problem.
You can look at Berkeley DB which has to be probably the fastest DB available, but it's mostly only key->value database.
If you really need higher speed consider loading the whole database in memory (using SQLite again).
Take a look at gigabase and its twin fastdb.
You might want to consider Embedded innoDB. It offers the basic SQL functionality (obviously, see MySQL) but doesn't offer the actual SQL syntax (as that's part of MySQL, not innoDB). At 838KB, it's not too heavy.
If you just need those basic operations, you don't really need SQL. Take a look at NoSQL data storage, for example Tokyo Cabinet.
you can try leveldb, it's key/value store
http://code.google.com/p/leveldb