Multijoin queries in Django - django

What's the best and/or fastest method of doing multijoin queries in Django using the ORM and QuerySet API?

If you are trying to join across tables linked by ForeignKeys or ManyToManyField relationships then you can use the double underscore syntax. For example if you have the following models:
class Foo(models.Model):
name = models.CharField(max_length=255)
class FizzBuzz(models.Model):
bleh = models.CharField(max_length=255)
class Bar(models.Model):
foo = models.ForeignKey(Foo)
fizzbuzz = models.ForeignKey(FizzBuzz)
You can do something like:
Fizzbuzz.objects.filter(bar__foo__name = "Adrian")

Don't use the API ;-) Seriously, if your JOIN are complex, you should see significant performance increases by dropping down in to SQL rather than by using the API. And this doesn't mean you need to get dirty dirty SQL all over your beautiful Python code; just make a custom manager to handle the JOINs and then have the rest of your code use it rather than direct SQL.
Also, I was just at DjangoCon where they had a seminar on high-performance Django, and one of the key things I took away from it was that if performance is a real concern (and you plan to have significant traffic someday), you really shouldn't be doing JOINs in the first place, because they make scaling your app while maintaining decent performance virtually impossible.
Here's a video Google made of the talk:
http://www.youtube.com/watch?v=D-4UN4MkSyI&feature=PlayList&p=D415FAF806EC47A1&index=20
Of course, if you know that your application is never going to have to deal with that kind of scaling concern, JOIN away :-) And if you're also not worried about the performance hit of using the API, then you really don't need to worry about the (AFAIK) miniscule, if any, performance difference between using one API method over another.
Just use:
http://docs.djangoproject.com/en/dev/topics/db/queries/#lookups-that-span-relationships
Hope that helps (and if it doesn't, hopefully some true Django hacker can jump in and explain why method X actually does have some noticeable performance difference).

Use the queryset.query.join method, but only if the other method described here (using double underscores) isn't adequate.

Caktus blog has an answer to this: http://www.caktusgroup.com/blog/2009/09/28/custom-joins-with-djangos-queryjoin/
Basically there is a hidden QuerySet.query.join method that allows adding custom joins.

Related

jOOQ CommonTableExpression with SelectQuery

We have a query with a common structure that we use under different circumstances (where clause is different)
So we have something like this
val baseQuery: SelectQuery<Record> = dsl
.select(someFields)
.from(someTable)
.join(otherTable).on(joinClause)
.query
In other places we then extend this query however needed.
baseQuery.apply {
addConditions(conditionsA)
}.fetch()
baseQuery.apply {
addConditions(conditionsB)
}.fetch()
So far so good. But now it would be cool if we could somehow use this base in combination with a CTE. Not sure how to do that though.
val someCTE: CommonTableExpression<*> = DSL.....
// dsl
// .with(someCTE)
// .selectQuery(baseQuery.apply {}) ¯\_(ツ)_/¯
baseQuery.apply {
// addWith(someCTE) ¯\_(ツ)_/¯
addSelect(someCTE.field(cteField))
addJoin(someCTE, joinClause)
addConditions(conditionsC)
}
Is there a way to do it? Perhaps other suggestions how to reuse the base query when using a CTE?
Edit: Solution
With the help of Lukas' answer I settled on this approach
fun dynamicQuery(
context: SelectSelectStep<*> = dsl.select(),
selects: List<Field<*>> = listOf(),
joins: List<Pair<Table<*>, Condition>> = listOf(),
conditions: List<Condition> = listOf()
): SelectQuery<Record>
So in normal customizations I can
dynamicQuery(
conditions = conditionsA
).fetch()
dynamicQuery(
conditions = conditionsB
).fetch()
It can be combined with a CTE
val someCTE: CommonTableExpression<*> = DSL.....
dynamicQuery(
context = dsl.with(someCTE).select(),
selects = listOf(someCTE.field(cteField)),
joins = listOf(someCTE to joinClause),
conditions = conditionsC
).fetch()
TL;DR: You can't use CTEs with jOOQ 3.15's model API
Some background on the jOOQ model API vs DSL API distinction
The very old jOOQ 1.0 only had what is now called the "model API", a mutable, procedural API with setters (no getters), where you can manipulate dynamic SQL.
jOOQ 2.0 introduced the DSL API, which is what most people are using today. The fact that the DSL API mimicks the SQL language helps users discover jOOQ API much more easily. Everything is named exactly as expected. With the exception of a few quirks in the areas of CTE and derived tables, you can write jOOQ-SQL almost just like actual SQL.
The model API was not deprecated, but wrapped by the DSL API, and kept around:
for backwards compatibility reasons
because some people seemed to like the procedural approach
You can't do anything with the model API that you couldn't do with the DSL API as well, though a more functional programming style may be helpful when doing this with the DSL API. See: https://blog.jooq.org/2017/01/16/a-functional-programming-approach-to-dynamic-sql-with-jooq
The future of jOOQ
While the model API is still getting some new clauses support for SELECT, INSERT, UPDATE, DELETE statements, there are some statements that are not available in a model API form. These include MERGE, TRUNCATE, all DDL statements, all procedural statements. And, the WITH clause.
The strategy is to eventually deprecate the model API, because the redundancy creates a lot of extra work that is better invested elsewhere. There are also subtle bugs when people call model API methods in unexpected order, i.e. an order that is not possible through the DSL API.
In a first step, pretty soon, jOOQ will inverse the relationship between APIs: https://github.com/jOOQ/jOOQ/issues/11241. The model API will be the auxiliary wrapper of the DSL API for those who rely on it for backwards compatibility. It isn't unlikely that the model API will even be extracted into a separate compatibility module, to discourage its use in new code
In a next step, with the dependencies inversed, the DSL API can finally become consistently immutable, which is what many users expect, and to their surprise, find lacking: https://github.com/jOOQ/jOOQ/issues/9047
Eventually, the model API will be deprecated, and then dropped
You can still use it today, and the deprecation and removal will be done over a long period of time, so there's no hurry in getting off this API (as always with jOOQ). But in the context of your question, it's good to see that jOOQ will not invest in adding too many features to it, anymore. CTE support won't be added to the model API.
Workarounds
You can, of course, work around this limitation, because internally, the model API is CTE capable:
You could use reflection to add new CTEs to the SelectQuery internal representation. I won't document how this works, here, because it's never a good idea to document these things :)
You could start creating a query using the DSL API, and then extract the internal SelectQuery representation using SelectFinalStep.getQuery(), and continue working from there.

What is the difference between annotations and regular lookups using Django's JSONField?

You can query Django's JSONField, either by direct lookup, or by using annotations. Now I realize if you annotate a field, you can all sorts of complex queries, but for the very basic query, which one is actually the preferred method?
Example: Lets say I have model like so
class Document(models.Model):
data = JSONField()
And then I store an object using the following command:
>>> Document.objects.create(data={'name': 'Foo', 'age': 24})
Now, the query I want is the most basic: Find all documents where data__name is 'Foo'. I can do this 2 ways, one using annotation, and one without, like so:
>>> from django.db.models.expressions import RawSQL
>>> Document.objects.filter(data__name='Foo')
>>> Document.objects.annotate(name = RawSQL("(data->>'name')::text", [])).filter(name='Foo')
So what exactly is the difference? And if I can make basic queries, why do I need to annotate? Provided of course I am not going to make complex queries.
There is no reason whatsoever to use raw SQL for queries where you can use ORM syntax. For someone who is conversant in SQL but less experienced with Django's ORM, RawSQL might provide an easier path to a certain result than the ORM, which has its own learning curve.
There might be more complex queries where the ORM runs into problems or where it might not give you the exact SQL query that you need. It is in these cases that RawSQL comes in handy – although the ORM is getting more feature-complete with every iteration, with
Cast (since 1.10),
Window functions (since 2.0),
a constantly growing array of wrappers for database functions
the ability to define custom wrappers for database functions with Func expressions (since 1.8) etc.
They are interchangable so it's matter of taste. I think Document.objects.filter(data__name='Foo') is better because:
It's easier to read
In the future, MariaDB or MySql can support JSON fields and your code will be able to run on both PostgreSQL and MariaDB.
Don't use RawSQL as a general rule. You can create security holes in your app.

Django ORM Which technique is better?

I have a project.
project = Project.objects.get(id=1)
and now i want to select the data from related tables of project. It can be done it 2 ways, let me know which one is better. and why?
attachments = project.attachments_set.all()
samples = project.projectsamples_set.all()
OR
attachments = Attachments.objects.filter(project=ctx['project'])
samples = ProjectSamples.objects.filter(project=ctx['project'])
I would like to know the Technical prospective.
These queries are exactly equivalent, as you can see if you examine the generated SQL. I would say that the first is preferable as it is more compact and readable, but that is very much subjective so it is up to you which you use.
(Note that if you don't actually have the project object to start with, and don't need it, then it's more efficient to query Attachments and Samples via project_id than to get the product and use the related accessors. However that doesn't appear to be the case in your example.)

Django: Filtering a queryset locally from cache

If I perform a prefetch_related('toppings') for a queryset, and I want to later filter(spicy=True) by fields in the related table, Django ignores the cached info and does a database query. I found that this is documented (under the Note box) and seems to happen for all forms of caching (select_related(), already evaluated querysets, etc.) when another filter() is performed.
However, is there some sort of super secret hidden time-saving shortcut to filter locally (using the cache and not hitting the database) without having to write the python code to loop the queryset (using list/dict comprehension, etc.)? Maybe something like a filter_locally(spicy=True)?
EDIT:
One of the reasons why a list/comprehension doesn't work well for me is because a list/dict does not have the queryset methods. In my case, the first level M2M field, toppings, isn't the end goal for me and I need to check a 2nd related M2M field (which I have already pre-fetched as well). While this is also possible using list comprehension, it's just much simpler to have something such as filter_locally(spicy=True, origin__country='Spain') because:
it allows accessing many levels of related fields with minimal effort
it allows chaining other queryset methods
it's easier to read because it's consistent with the familiar filter()
it's easier to modify existing code using filter() without prefetch to add this optimization in without much changes.
But from the responses, Django has no such support :(
You have to write the python code to loop through the queryset (a list/dict comprehension is ideal). All the filter() code knows how to do is add filtering language to the SQL sent to the database. Filtering locally is a totally different problem than filtering remotely, so the solutions to those two separate problems won't be able to share any logic.
A list comprehension one-liner would be pretty straightforward, though; the syntax might not be much more complex than with filter().
If you're filtering on a boolean doing the list comprehension is pretty easy. You can also swap out the topping.spicy==True for a string comparison or whatever.
I would do something like:
qs = Pizza.objects.all().prefetch_related('toppings')
res = list(qs)
def get_spicy(qs):
res = list(qs)
return [pizza for pizza in res if any(topping.spicy==True for
topping in pizza.toppings.all())]
That is if you want to return the pizza object if any of its toppings is spicy. You can also replace the any() with all() to check for all, and do a lot of pretty powerful queries with this syntax. I'm somewhat surprised that there is no easy way to do this in django. It seems like a lot of these simple queries should be easy to implement in a generic manner.
The above code assumes a many2many. It should be easy to modify to work with a simple FK relationship such as a one2one or one2many.
Hope this was helpful.

Secure-by-default django ORM layer---how?

I'm running a Django shop where we serve each our clients an object graph which is completely separate from the graphs of all the other clients. The data is moderately sensitive, so I don't want any of it to leak from one client to another, nor for one client to delete or alter another client's data.
I would like to structure my code such that I by default write code which adheres to the security requirements (No hard guarantees necessary), but lets me override them when I know I need to.
My main fear is that in a Twig.objects.get(...), I forget to add client=request.client, and likewise for Leaf.objects.get where I have to check that twig__client=request.client. This quickly becomes error-prone and complicated.
What are some good ways to get around my own forgetfulness? How do I make this a thing I don't have to think about?
One candidate solution I have in mind is this:
Set the default object manager as DANGER = models.Manager() on my abstract base class(es).
Have a method ok(request) on said base classes which applies .filter(leaf__twig__branch__trunk__root__client=request.client) as applicable.
use MyModel.ok(request) instead of MyModel.objects wherever feasible.
Can this be improved upon? One not so nice issue is when a view calls a model method, e.g. branch.get_twigs_with_fruit, I now have to either pass a request for it to run through ok or I have to invoke DANGER. I like neither :-\
Is there some way of getting access to the current request? I think that might mitigate the situation...
Ill explain a different problem I had however I think the solution might be something to look into.
Once I was working on a project to visualize data where I needed to have a really big table which will store all the data for all visualizations. That turned out to be a big problem because I would have to do things like Model.objects.filter(visualization=5) which was just not very elegant and not efficient.
To make things simpler and more efficient I ended up creating dynamic models on the fly. Essentially I would create a separate table in the db on the fly and then store a data only for that one visualization in that. My code is something like:
def get_model_class(table_name):
class ModelBase(ModelBase):
def __new__(cls, name, bases, attrs):
name = '{}_{}'.format(name, table_name)
return super(ModelBase, cls).__new__(cls, name, bases, attrs)
class Data(models.Model):
# fields here
__metaclass__ = ModelBase
class Meta(object):
db_table = table_name
return Data
dynamic_model = get_model_class('foo')
This was useful for my purposes because it allowed queries to be much faster but getting back to your issue I think something like this can be useful because this will make sure that each client's data is separate not only via a foreign key, but is actually separated in the db.
Using this method is pretty straight forward except before using the model, you have to call the function to get it for each client. To make things more efficient you can cache/memoize the results of the function call so that it does not have to recompute the same thing more than once.