Customizing the generated SELECT in a query? [Granite ORM] - crystal-lang

I'm trying build a JOIN query in Amber (using the Granite ORM) on a legacy database (with existing data & table structure), and wondering if it's possible to customize the SELECT FROM portion of the query to support a cross-table JOIN.
Here's the current table structure for a table called vehicles:
-----------------------------------------------
| vehicleid | year | makeid | modelid |
-----------------------------------------------
| 1 | 1999 | 54 | 65 |
| 2 | 2000 | 55 | 72 |
| ... | ... | ... | ... |
-----------------------------------------------
etc.
Where makeid and modelid are foreign key references to makes and models table. In those tables are naming columns (makename and modelname, respectively).
I'm trying to generate a JOIN query to pull in the names:
SELECT vehicle.yearid, make.makename AS make, model.modelname AS model FROM vehicles JOIN....
(snipping out the JOIN details).
So that when the query returns I have a single Vehicle object and can access:
Vehicle.yearid,
Vehicle.make, and
Vehicle.model
Is this possible using Granite?
I can get the JOIN portion of the query to generate by using raw SQL, but I can't figure out how to customize the table & column names in the SELECT portion. I've tried creating an object as so:
class Vehicle < Granite::ORM::Base
adapter pg
primary vehicleid : Int32
field yearid : Int32
field make : String
field model : String
end
But Granite is generating the following SQL:
SELECT vehicle.yearid, vehicle.make, vehicle.model FROM vehicle JOIN...
That's throwing an error because vehicle.make and vehicle.model don't actually exist.
what I want is this SQL:
SELECT vehicle.yearid, make.makename AS make, model.modelname AS model FROM vehicles JOIN....
Is there a way to make this work?

According to this issue, Granite does not yet have one-to-one relationships, but the author mentions that there is a temporary workaround by using the has_many macro and define a method which calls the method defined by the macro but returns the first element in the Array returned by that method (since it can only be one element).
First you need to create models for the two other tables, model and make:
class Model < Granite::ORM::Base
adapter pg
belongs_to :vehicle
primary modelid : Int32
field modelname : String
end
class Make < Granite::ORM::Base
adapter pg
belongs_to :vehicle
primary makeid : Int32
field makename : String
end
If you have more fields than just modelname or makename, be sure to add those as well.
And lastly, you need to add has_many relationships to the original Vehicle class, and define make and model methods:
class Vehicle < Granite::ORM::Base
adapter pg
primary vehicleid : Int32
field yearid : Int32
has_many :makes
has_many :models
def make
makes.first["makename"]
end
def model
models.first["modelname"]
end
end
Querying is then as simple as:
vehicle = Vehicle.find 2
puts vehicle.model
Unfortunately I don't believe that Granite yet supports column aliases (AS) without completely bypassing the ORM, so you have to return those columns explicitly (which the code above does) or access the property directly with vehicle.model["modelname"].
Note: I may have gotten the types of the Hash returned by Granite wrong, since their source-code hasn't got any type annotations and fully relies on Crystal's type inference, which makes it hard to navigate. But I think it is {} of String => DB::Any, but I could be wrong. If you get a compiler error, try with a Symbol instead of String.

Thanks to #svenskunganka for giving me an idea to think this route, I came up with a solution in the spirit of Granite that stays close to raw SQL and lets the ORM stick to mapping fields to object.
I added a sql class method to the model definition that behaves almost identically to all but strips away a bit more structure. I also had to add a new query method to the pg adapter in order to support it, but this now works for my use case. Here's the monkey-patched code:
class Granite::Adapter::Pg < Granite::Adapter::Base
def query(statement = "", params = [] of DB::Any, &block)
statement = _ensure_clause_template(statement)
log statement, params
open do |db|
db.query statement, params do |rs|
yield rs
end
end
end
end
module Granite::ORM::Querying
def sql(clause = "", params = [] of DB::Any)
rows = [] of self
##adapter.query(clause, params) do |results|
results.each do
rows << from_sql(results)
end
end
return rows
end
end
It's a bit ugly (and am open to suggestions for cleaning this up) but I can now write the following code:
vehicles = Vehicle.sql("SELECT vehicle.vehicleid, vehicle.yearid, make.makename AS make,
model.modelname AS model FROM vehicle JOIN ...<snip>")
I can then do something like:
vehicles.each do |v|
puts "#{v.yearid} #{v.make} #{v.model}"
end
And it works as-expected.

Related

Django ORM: django aggregate over filtered reverse relation

The question is remotely related to Django ORM: filter primary model based on chronological fields from related model, by further limiting the resulting queryset.
The models
Assuming we have the following models:
class Patient(models.Model)
name = models.CharField()
# other fields following
class MedicalFile(model.Model)
patient = models.ForeignKey(Patient, related_name='files')
issuing_date = models.DateField()
expiring_date = models.DateField()
diagnostic = models.CharField()
The query
I need to select all the files which are valid at a specified date, most likely from the past. The problem that I have here is that for every patient, there will be a small overlapping period where a patient will have 2 valid files. If we're querying for a date from that small timeframe, I need to select only the most recent file.
More to the point: consider patient John Doe. he will have string of "uninterrupted" files starting with 2012 like this:
+---+------------+-------------+
|ID |issuing_date|expiring_date|
+---+------------+-------------+
|1 |2012-03-06 |2013-03-06 |
+---+------------+-------------+
|2 |2013-03-04 |2014-03-04 |
+---+------------+-------------+
|3 |2014-03-04 |2015-03-04 |
+---+------------+-------------+
As one can easily observe, there is an overlap of couple of days of the validity of these files. For instance, in 2013-03-05 the files 1 and 2 are valid, but we're considering only file 2 (as the most recent one). I'm guessing that the use case isn't special: this is the case of managing subscriptions, where in order to have a continuous subscription, you will renew your subscription earlier.
Now, in my application I need to query historical data, e.g. give me all the files which where valid at 2013-03-05, considering only the "most recent" ones. I was able to solve this by using RawSQL, but I would like to have a solution without raw SQL. In the previous question, we were able to filter the "latest" file by aggregation over the reverse relation, something like:
qs = MedicalFile.objects.annotate(latest_file_date=Max('patient__files__issuing_date'))
qs = qs.filter(issuing_date=F('latest_file_date')).select_related('patient')
The problem is that we need to limit the range over which latest_file_date is computed, by filtering against 2013-03-05. But aggregate function don't run over filtered querysets ...
The "poor" solution
I'm currently doing this via an extra queryset clause (substitute "app" with your concrete application):
reference_date = datetime.date(year=2013, month=3, day=5)
annotation_latest_issuing_date = {
'latest_issuing_date': RawSQL('SELECT max(file.issuing_date) '
'FROM <app>_medicalfile file '
'WHERE file.person_id = <app>_medicalfile.person_id '
' AND file.issuing_date <= %s', (reference_date, ))
}
qs = MedicalFile.objects.filter(expiring_date__gt=reference_date, issuing_date__lte=reference_date)
qs = qs.extra(**annotation_latest_issuing_date).filter(issuing_date=F('latest_issuing_date'))
Writen as such, the queryset returns correct number of records.
Question: how can it be achieved without RaWSQL and (already implied) with the same performance level ?
You can use id__in and provide your nested filtered queryset (like all files that are valid at the given date).
qs = MedicalFile.objects
.filter(id__in=self.filter(expiring_date__gt=reference_date, issuing_date__lte=reference_date))
.order_by('patient__pk', '-issuing_date')
.distinct('patient__pk') # field_name parameter only supported by Postgres
The order_by groups the files by patient, with the latest issuing date first. distinct then retrieves that first file for each patient. However, general care is required when combining order_by and distinct: https://docs.djangoproject.com/en/1.9/ref/models/querysets/#django.db.models.query.QuerySet.distinct
Edit: Removed single patient dependence from first filter and changed latest to combination of order_by and distinct
Consider p is a Patient class instance.
I think you can do someting like:
p.files.filter(issue_date__lt='some_date', expiring_date__gt='some_date')
See https://docs.djangoproject.com/en/1.9/topics/db/queries/#backwards-related-objects
Or maybe with the Q magic query object...

Django ORM: Select items where latest status is `success`

I have this model.
class Item(models.Model):
name=models.CharField(max_length=128)
An Item gets transferred several times. A transfer can be successful or not.
class TransferLog(models.Model):
item=models.ForeignKey(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
How can I query for all Items which latest TransferLog was successful?
With "latest" I mean ordered by timestamp.
TransferLog Table
Here is a data sample. Here item1 should not be included, since the last transfer was not successful:
ID|item_id|timestamp |success
---------------------------------------
1 | item1 |2014-11-15 12:00:00 | False
2 | item1 |2014-11-15 14:00:00 | True
3 | item1 |2014-11-15 16:00:00 | False
I know how to solve this with a loop in python, but I would like to do the query in the database.
An efficient trick is possible if timestamps in the log are increasing, that is the end of transfer is logged as timestamp (not the start of transfer) or if you can expect ar least that the older transfer has ended before a newer one started. Than you can use the TransferLog object with the highest id instead of with the highest timestamp.
from django.db.models import Max
qs = TransferLog.objects.filter(id__in=TransferLog.objects.values('item')
.annotate(max_id=Max('id')).values('max_id'), success=True)
It makes groups by item_id in the subquery and sends the highest id for every group to the main query, where it is filtered by success of the latest row in the group.
You can see that it is compiled to the optimal possible one query directly by Django.
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql())
SELECT L.id, L.item_id, L.timestamp, L.success FROM app_transferlog L
WHERE L.success = true AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0 GROUP BY U0.item_id )
(I edited the example result compiled SQL for better readability by replacing many "app_transferlog"."field" by a short alias L.field, by substituting the True parameter directly into SQL and by editing whitespace and parentheses.)
It can be improved by adding some example filter and by selecting the related Item in the same query:
kwargs = {} # e.g. filter: kwargs = {'timestamp__gte': ..., 'timestamp__lt':...}
qs = TransferLog.objects.filter(
id__in=TransferLog.objects.filter(**kwargs).values('item')
.annotate(max_id=Max('id')).values('max_id'),
success=True).select_related('item')
Verified how compiled to SQL: print(qs.query.get_compiler('default').as_sql()[0])
SELECT L.id, L.item_id, L.timestamp, L.success, I.id, I.name
FROM app_transferlog L INNER JOIN app_item I ON ( L.item_id = I.id )
WHERE L.success = %s AND L.id IN
( SELECT MAX(U0.id) AS max_id FROM app_transferlog U0
WHERE U0.timestamp >= %s AND U0.timestamp < %s
GROUP BY U0.item_id )
print(qs.query.get_compiler('default').as_sql()[1])
# result
(True, <timestamp_start>, <timestamp_end>)
Useful fields of latest TransferLog and the related Items are acquired by one query:
for logitem in qs:
item = logitem.item # the item is still cached in the logitem
...
The query can be more optimized according to circumstances, e.g. if you are not interested in the timestamp any more and you work with big data...
Without assumption of increasing timestamps it is really more complicated than a plain Django ORM. My solutions can be found here.
EDIT after it has been accepted:
An exact solution for a non increasing dataset is possible by two queries:
Get a set of id of the last failed transfers. (Used fail list, because it is much smaller small than the list of successful tranfers.)
Iterate over the list of all last transfers. Exclude items found in the list of failed transfers.
This way can be be efficiently simulated queries that would otherwise require a custom SQL:
SELECT a_boolean_field_or_expression,
rank() OVER (PARTITION BY parent_id ORDER BY the_maximized_field DESC)
FROM ...
WHERE rank = 1 GROUP BY parent_object_id
I'm thinking about implementing an aggregation function (e.g. Rank(maximized_field) ) as an extension for Django with PostgresQL, how it would be useful.
try this
# your query
items_with_good_translogs = Item.objects.filter(id__in=
(x.item.id for x in TransferLog.objects.filter(success=True))
since you said "How can I query for all Items which latest TransferLog was successful?", it is logically easy to follow if you start the query with Item model.
I used the Q Object which can be useful in places like this. (negation, or, ...)
(x.item.id for x in TransferLog.objects.filter(success=True)
gives a list of TransferLogs where success=True is true.
You will probably have an easier time approaching this from the ItemLog thusly:
dataset = ItemLog.objects.order_by('item','-timestamp').distinct('item')
Sadly that does not weed out the False items and I can't find a way to apply the filter AFTER the distinct. You can however filter it after the fact with python listcomprehension:
dataset = [d.item for d in dataset if d.success]
If you are doing this for logfiles within a given timeperiod it's best to filter that before ordering and distinct-ing:
dataset = ItemLog.objects.filter(
timestamp__gt=start,
timestamp__lt=end
).order_by(
'item','-timestamp'
).distinct('item')
If you can modify your models, I actually think you'll have an easier time using ManyToMany relationship instead of explicit ForeignKey -- Django has some built-in convenience methods that will make your querying easier. Docs on ManyToMany are here. I suggest the following model:
class TransferLog(models.Model):
item=models.ManyToManyField(Item)
timestamp=models.DateTimeField()
success=models.BooleanField(default=False)
Then you could do (I know, not a nice, single-line of code, but I'm trying to be explicit to be clearer):
results = []
for item in Item.objects.all():
if item.transferlog__set.all().order_by('-timestamp')[0].success:
results.append(item)
Then your results array will have all the items whose latest transfer was successful. I know, it's still a loop in Python...but perhaps a cleaner loop.

django orm - How to use select_related() on the Foreign Key of a Subclass from its Super Class

I've always found the Django orm's handling of subclassing models to be pretty spiffy. That's probably why I run into problems like this one.
Take three models:
class A(models.Model):
field1 = models.CharField(max_length=255)
class B(A):
fk_field = models.ForeignKey('C')
class C(models.Model):
field2 = models.CharField(max_length=255)
So now you can query the A model and get all the B models, where available:
the_as = A.objects.all()
for a in the_as:
print a.b.fk_field.field2 #Note that this throws an error if there is no B record
The problem with this is that you are looking at a huge number of database calls to retrieve all of the data.
Now suppose you wanted to retrieve a QuerySet of all A models in the database, but with all of the subclass records and the subclass's foreign key records as well, using select_related() to limit your app to a single database call. You would write a query like this:
the_as = A.objects.select_related("b", "b__fk_field").all()
One query returns all of the data needed! Awesome.
Except not. Because this version of the query is doing its own filtering, even though select_related is not supposed to filter any results at all:
set_1 = A.objects.select_related("b", "b__fk_field").all() #Only returns A objects with associated B objects
set_2 = A.objects.all() #Returns all A objects
len(set_1) > len(set_2) #Will always be False
I used the django-debug-toolbar to inspect the query and found the problem. The generated SQL query uses an INNER JOIN to join the C table to the query, instead of a LEFT OUTER JOIN like other subclassed fields:
SELECT "app_a"."field1", "app_b"."fk_field_id", "app_c"."field2"
FROM "app_a"
LEFT OUTER JOIN "app_b" ON ("app_a"."id" = "app_b"."a_ptr_id")
INNER JOIN "app_c" ON ("app_b"."fk_field_id" = "app_c"."id");
And it seems if I simply change the INNER JOIN to LEFT OUTER JOIN, then I get the records that I want, but that doesn't help me when using Django's ORM.
Is this a bug in select_related() in Django's ORM? Is there any work around for this, or am I simply going to have to do a direct query of the database and map the results myself? Should I be using something like Django-Polymorphic to do this?
It looks like a bug, specifically it seems to be ignoring the nullable nature of the A->B relationship, if for example you had a foreign key reference to B in A instead of the subclassing, that foreign key would of course be nullable and django would use a left join for it. You should probably raise this in the django issue tracker. You could also try using prefetch_related instead of select_related that might get around your issue.
I found a work around for this, but I will wait a while to accept it in hopes that I can get some better answers.
The INNER JOIN created by the select_related('b__fk_field') needs to be removed from the underlying SQL so that the results aren't filtered by the B records in the database. So the new query needs to leave the b__fk_field parameter in select_related out:
the_as = A.objects.select_related('b')
However, this forces us to call the database everytime a C object is accessed from the A object.
for a in the_as:
#Note that this throws an DoesNotExist error if a doesn't have an
#associated b
print a.b.fk_field.field2 #Hits the database everytime.
The hack to work around this is to get all of the C objects we need from the database from one query and then have each B object reference them manually. We can do this because the database call that accesses the B objects retrieved will have the fk_field_id that references their associated C object:
c_ids = [a.b.fk_field_id for a in the_as] #Get all the C ids
the_cs = C.objects.filter(pk__in=c_ids) #Run a query to get all of the needed C records
for c in the_cs:
for a in the_as:
if a.b.fk_field_id == c.pk: #Throws DoesNotExist if no b associated with a
a.b.fk_field = c
break
I'm sure there's a functional way to write that without the nested loop, but this illustrates what's happening. It's not ideal, but it provides all of the data with the absolute minimum number of database hits - which is what I wanted.

Ordering entries via comment count with django

I need to get entries from database with counts of comments. Can i do it with django's comment framework? I am also using a voting application which is not using GenericForeignKeys i get entries with scores like this:
class EntryManager(models.ModelManager):
def get_queryset(self):
return super(EntryManager,self).get_queryset(self).all().annotate(\
score=Sum("linkvote__value"))
But when there is foreignkeys i am being stuck. Do you have any ideas about that?
extra explaination: i need to fetch entries like this:
id | body | vote_score | comment_score |
1 | foo | 13 | 4 |
2 | bar | 4 | 1 |
after doing that, i can order them via comment_score. :)
Thans for all replies.
Apparently, annotating with reverse generic relations (or extra filters, in general) is still an open ticket (see also the corresponding documentation). Until this is resolved, I would suggest using raw SQL in an extra query, like this:
return super(EntryManager,self).get_queryset(self).all().annotate(\
vote_score=Sum("linkvote__value")).extra(select={
'comment_score': """SELECT COUNT(*) FROM comments_comment
WHERE comments_comment.object_pk = yourapp_entry.id
AND comments_comment.content_type = %s"""
}, select_params=(entry_type,))
Of course, you have to fill in the correct table names. Furthermore, entry_type is a "constant" that can be set outside your lookup function (see ContentTypeManager):
from django.contrib.contenttypes.models import ContentType
entry_type = ContentType.objects.get_for_model(Entry)
This is assuming you have a single model Entry that you want to calculate your scores on. Otherwise, things would get slightly more complicated: you would need a sub-query to fetch the content type id for the type of each annotated object.

Left Join with a OneToOne field in Django

I have 2 tables, simpleDB_all and simpleDB_some. The "all" table has an entry for every item I want, while the "some" table has entries only for some items that need additional information. The Django models for these are:
class all(models.Model):
name = models.CharField(max_length=40)
important_info = models.CharField(max_length=40)
class some(models.Model):
all_key = models.OneToOneField(all)
extra_info = models.CharField(max_length=40)
I'd like to create a view that shows every item in "all" with the extra info if it exists in "some". Since I'm using a 1-1 field I can do this with almost complete success:
allitems = all.objects.all()
for item in allitems:
print item.name, item.important_info, item.some.extra_info
but when I get to the item that doesn't have a corresponding entry in the "some" table I get a DoesNotExist exception.
Ideally I'd be doing this loop inside a template, so it's impossible to wrap it around a "try" clause. Any thoughts?
I can get the desired effect directly in SQL using a query like this:
SELECT all.name, all.important_info, some.extra_info
FROM all LEFT JOIN some ON all.id = some.all_key_id;
But I'd rather not use raw SQL.
You won't get a DoesNotExist exception in the template - they are hidden, by design, by the template system.
The SQL you give is what is generated, more or less, when you use select_related on your query (if you're using Django 1.2 or a checkout more recent than r12307, from February):
allitems = all.objects.select_related('some')