Django + PostgreSQL: Fill missing dates in a range - django

I have a table with one of the columns as date. It can have multiple entries for each date.
date .....
----------- -----
2015-07-20 ..
2015-07-20 ..
2015-07-23 ..
2015-07-24 ..
I would like to get data in the following form using Django ORM with PostgreSQL as database backend:
date count(date)
----------- -----------
2015-07-20 2
2015-07-21 0 (missing after aggregation)
2015-07-22 0 (missing after aggregation)
2015-07-23 1
2015-07-24 1
Corresponding PostgreSQL Query:
WITH RECURSIVE date_view(start_date, end_date)
AS ( VALUES ('2015-07-20'::date, '2015-07-24'::date)
UNION ALL SELECT start_date::date + 1, end_date
FROM date_view
WHERE start_date < end_date )
SELECT start_date, count(date)
FROM date_view LEFT JOIN my_table ON date=start_date
GROUP BY date, start_date
ORDER BY start_date ASC;
I'm having trouble translating this raw query to Django ORM query.
It would be great if someone can give a sample ORM query with/without a workaround for Common Table Expressions using PostgreSQL as database backend.
The simple reason is quoted here:
My preference is to do as much data processing in the database, short of really involved presentation stuff. I don't envy doing this in application code, just as long as it's one trip to the database
As per this answer django doesn't support CTE's natively, but the answer seems quite outdated.
References:
MySQL: Select All Dates In a Range Even If No Records Present
WITH Queries (Common Table Expressions)
Thanks

I do not think you can do this with pure Django ORM, and I am not even sure if this can be done neatly with extra(). The Django ORM is incredibly good in handling the usual stuff, but for more complex SQL statements and requirements, more so with DBMS specific implementations, it is just not quite there yet. You might have to go lower and down to executing raw SQL directly, or offload that requirement to be done by the application layer.
You can always generate the missing dates using Python, but that will be incredibly slow if the range and number of elements are huge. If this is being requested by AJAX for other use (e.g. charting), then you can offload that to Javascript.

from datetime import date, timedelta
from django.db.models.functions import Trunc
from django.db.models.expressions import Value
from django.db.models import Count, DateField
# A is model
start_date = date(2022, 5, 1)
end_date = date(2022, 5, 10)
result = A.objects\
.annotate(date=Trunc('created', 'day', output_field=DateField())) \
.filter(date__gte=start_date, date__lte=end_date) \
.values('date')\
.annotate(count=Count('id'))\
.union(A.objects.extra(select={
'date': 'unnest(Array[%s]::date[])' %
','.join(map(lambda d: "'%s'::date" % d.strftime('%Y-%m-%d'),
set(start_date + timedelta(n) for n in range((end_date - start_date).days + 1)) -
set(A.objects.annotate(date=Trunc('created', 'day', output_field=DateField())) \
.values_list('date', flat=True))))})\
.annotate(count=Value(0))\
.values('date', 'count'))\
.order_by('date')

In stead of the recursive CTE you could use generate_series() to construct a calendar-table:
SELECT calendar, count(mt.zdate) as THE_COUNT
FROM generate_series('2015-07-20'::date
, '2015-07-24'::date
, '1 day'::interval) calendar
LEFT JOIN my_table mt ON mt.zdate = calendar
GROUP BY 1
ORDER BY 1 ASC;
BTW: I renamed date to zdate. DATE is a bad name for a column (it is the name for a data type)

Related

How to enforce Django to use "JOIN VALUES"

I'm having a performance problem where I need to replace section of my query statement. Right now I have a the following:
select count(*) FROM "mytable" WHERE "field" IN ('v1', 'v2', ..., 'vN');
this can be translated to Django ORM:
Mytable.objects.all().filter(field__in=[myvalues]).count()
I need to do the following though:
select count(*) FROM "mytable" JOIN (values ('v1', 'v2', ..., 'vN')) as lookup(value) on lookup.value = "mytable".field;
Is there a way to add this to the ORM? I need to do with ORM because I already have other filters. Worst case scenario I thought of getting the query string and adding there manually...
I'm using Postgresql 9.6
I found a way after reading over and over the documentation. I even found a patch that was not merged a while ago.
It doesn't really do the join, but it works much faster than using __in straightforward.
What I'm doing is executing a RawSQL() that was introduced in Django 2.0 and with that result I do the __in again.
So here is a code example:
query = """select myfield from mytable join (values
('v1'), ('v2'), ..., ('vN')
) as lookup(value) on lookup.value = mytable.myfield"""
r = RawSQL(query, [])
mymodel.filter(myfield__in=r)
Now it takes miliseconds instead of minutes!

Analyzing tweeter with hive, regex extract

I am trying to analyze what are the most popular hashtags of July. So far I am able to select tweets from July, or display the most popular tweets, but I didn't sucess in putting them together. I am thinking about creating a intermediate table with july tweets, then display the popular hashtags, but I don't know how, can you help me? What about a 2 level select (select a from select b from table) ?
SELECT hashtags.text, count(*) as total FROM tweets
WHERE regexp_extract(created_at, "(Tue) (Jul)*", 2) = "Jul"
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text), created_at
ORDER BY total_count DESC
LIMIT 200
Regards, K.
So far, I did this, which is pretty much what I want, but is there any mean to achieve this differently ?
Working nested query:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM (
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
) tweets
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
EDIT:
Ok, so if you want you can also do it by a temporary table:
CREATE TABLE tmpdb (
id BIGINT,
created_at STRING,
source STRING,
favorited BOOLEAN,
retweet_count INT,
retweeted_status STRUCT<
text:STRING,
user:STRUCT<screen_name:STRING,name:STRING>>,
entities STRUCT<
urls:ARRAY<STRUCT<expanded_url:STRING>>,
user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
hashtags:ARRAY<STRUCT<text:STRING>>>,
text STRING,
user STRUCT<
screen_name:STRING,
name:STRING,
friends_count:INT,
followers_count:INT,
statuses_count:INT,
verified:BOOLEAN,
utc_offset:INT,
time_zone:STRING>,
in_reply_to_screen_name STRING
)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
Then you update it:
INSERT OVERWRITE TABLE tmpdb
SELECT * FROM tweets WHERE regexp_extract(created_at,"(Tue Jul)*",1) = "Tue Jul"
And the request become as simple as this:
SELECT
LOWER(hashtags.text),
COUNT(*) AS total_count
FROM tmpdb
LATERAL VIEW EXPLODE(entities.hashtags) t1 AS hashtags
GROUP BY LOWER(hashtags.text)
ORDER BY total_count DESC
LIMIT 15
The pro/cons about the second method:
You need to update the table if you want accurate requests, so it is not suited for one-shot request, but if you need to do multiple requests on the current state of the database, then this method is better.
Don't forget that, copying a database is a costly operation ! So know when to use it :)

How to use subquery in django?

I want to get a list of the latest purchase of each customer, sorted by the date.
The following query does what I want except for the date:
(Purchase.objects
.all()
.distinct('customer')
.order_by('customer', '-date'))
It produces a query like:
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
I am forced to use customer_id as the first ORDER BY expression because of DISTINCT ON.
I want to sort by the date, so what the query I really need should look like this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I don't want to sort using python because I still got to page limit the query. There can be tens of thousands of rows in the database.
In fact it is currently sorted by in python now and is causing very long page load times, so that's why I'm trying to fix this.
Basically I want something like this https://stackoverflow.com/a/9796104/242969. Is it possible to express it with django querysets instead of writing raw SQL?
The actual models and methods are several pages long, but here is the set of models required for the queryset above.
class Customer(models.Model):
user = models.OneToOneField(User)
class Purchase(models.Model):
customer = models.ForeignKey(Customer)
date = models.DateField(auto_now_add=True)
item = models.CharField(max_length=255)
If I have data like:
Customer A -
Purchase(item=Chair, date=January),
Purchase(item=Table, date=February)
Customer B -
Purchase(item=Speakers, date=January),
Purchase(item=Monitor, date=May)
Customer C -
Purchase(item=Laptop, date=March),
Purchase(item=Printer, date=April)
I want to be able to extract the following:
Purchase(item=Monitor, date=May)
Purchase(item=Printer, date=April)
Purchase(item=Table, date=February)
There is at most one purchase in the list per customer. The purchase is each customer's latest. It is sorted by latest date.
This query will be able to extract that:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I'm trying to find a way not to have to use raw SQL to achieve this result.
This may not be exactly what you're looking for, but it might get you closer. Take a look at Django's annotate.
Here is an example of something that may help:
from django.db.models import Max
Customer.objects.all().annotate(most_recent_purchase=Max('purchase__date'))
This will give you a list of your customer models each one of which will have a new attribute called "most_recent_purchase" and will contain the date on which they made their last purchase. The sql produced looks like this:
SELECT "demo_customer"."id",
"demo_customer"."user_id",
MAX("demo_purchase"."date") AS "most_recent_purchase"
FROM "demo_customer"
LEFT OUTER JOIN "demo_purchase" ON ("demo_customer"."id" = "demo_purchase"."customer_id")
GROUP BY "demo_customer"."id",
"demo_customer"."user_id"
Another option, would be adding a property to your customer model that would look something like this:
#property
def latest_purchase(self):
return self.purchase_set.order_by('-date')[0]
You would obviously need to handle the case where there aren't any purchases in this property, and this would potentially not perform very well (since you would be running one query for each customer to get their latest purchase).
I've used both of these techniques in the past and they've both worked fine in different situations. I hope this helps. Best of luck!
Whenever there is a difficult query to write using Django ORM, I first try the query in psql(or whatever client you use). The SQL that you want is not this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id" "shop_purchase.id" "shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC, "shop_purchase.date" DESC;
) AS result
ORDER BY date DESC;
In the above SQL, the inner SQL is looking for distinct on a combination of (customer_id, id, and date) and since id will be unique for all, you will get all records from the table. I am assuming id is the primary key as per convention.
If you need to find the last purchase of every customer, you need to do something like:
SELECT "shop_purchase.customer_id", max("shop_purchase.date")
FROM shop_purchase
GROUP BY 1
But the problem with the above query is that it will give you only the customer name and date. Using that will not help you in finding the records when you use these results in a subquery.
To use IN you need a list of unique parameters to identify a record, e.g., id
If in your records id is a serial key, then you can leverage the fact that the latest date will be the maximum id as well. So your SQL becomes:
SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id";
Note that I kept only one field (id) in the selected clause to use it in a subquery using IN.
The complete SQL will now be:
SELECT *
FROM shop_customer
WHERE "shop_customer.id" IN
(SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id");
and using the Django ORM it looks like:
(Purchase.objects.filter(
id__in=Purchase.objects
.values('customer_id')
.annotate(latest=Max('id'))
.values_list('latest', flat=True)))
Hope it helps!
I have a similar situation and this is how I'm planning to go about it:
query = Purchase.objects.distinct('customer').order_by('customer').query
query = 'SELECT * FROM ({}) AS result ORDER BY sent DESC'.format(query)
return Purchase.objects.raw(query)
Upside it gives me the query I want. Downside is that it is raw query and I can't append any other queryset filters.
This is my approach if I need some subset of data (N items) along with the Django query. This is example using PostgreSQL and handy json_build_object() function (Postgres 9.4+), but same way you can use other aggregate function in other database system. For older PostgreSQL versions you can use combination of array_agg() and array_to_string() functions.
Imagine you have Article and Comment models and along with every article in the list you want to select 3 recent comments (change LIMIT 3 to adjust size of subset or ORDER BY c.id DESC to change sorting of subset).
qs = Article.objects.all()
qs = qs.extra(select = {
'recent_comments': """
SELECT
json_build_object('comments',
array_agg(
json_build_object('id', id, 'user_id', user_id, 'body', body)
)
)
FROM (
SELECT
c.id,
c.user_id,
c.body
FROM app_comment c
WHERE c.article_id = app_article.id
ORDER BY c.id DESC
LIMIT 3
) sub
"""
})
for article in qs:
print(article.recent_comments)
# Output:
# {u'comments': [{u'user_id': 1, u'id': 3, u'body': u'foo'}, {u'user_id': 1, u'id': 2, u'body': u'bar'}, {u'user_id': 1, u'id': 1, u'body': u'joe'}]}
# ....

Django: Distinct foreign keys

class Log:
project = ForeignKey(Project)
msg = CharField(...)
date = DateField(...)
I want to select the four most recent Log entries where each Log entry must have a unique project foreign key. I've tries the solutions on google search but none of them works and the django documentation isn't that very good for lookup..
I tried stuff like:
Log.objects.all().distinct('project')[:4]
Log.objects.values('project').distinct()[:4]
Log.objects.values_list('project').distinct('project')[:4]
But this either return nothing or Log entries of the same project..
Any help would be appreciated!
Queries don't work like that - either in Django's ORM or in the underlying SQL. If you want to get unique IDs, you can only query for the ID. So you'll need to do two queries to get the actual Log entries. Something like:
id_list = Log.objects.order_by('-date').values_list('project_id').distinct()[:4]
entries = Log.objects.filter(id__in=id_list)
Actually, you can get the project_ids in SQL. Assuming that you want the unique project ids for the four projects with the latest log entries, the SQL would look like this:
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4;
Now, you actually want all of the log information. In PostgreSQL 8.4 and later you can use windowing functions, but that doesn't work on other versions/databases, so I'll do it the more complex way:
SELECT logs.*
FROM logs JOIN (
SELECT project_id, max(log.date) as max_date
FROM logs
GROUP BY project_id
ORDER BY max_date DESC LIMIT 4 ) as latest
ON logs.project_id = latest.project_id
AND logs.date = latest.max_date;
Now, if you have access to windowing functions, it's a bit neater (I think anyway), and certainly faster to execute:
SELECT * FROM (
SELECT logs.field1, logs.field2, logs.field3, logs.date
rank() over ( partition by project_id
order by "date" DESC ) as dateorder
FROM logs ) as logsort
WHERE dateorder = 1
ORDER BY logs.date DESC LIMIT 1;
OK, maybe it's not easier to understand, but take my word for it, it runs worlds faster on a large database.
I'm not entirely sure how that translates to object syntax, though, or even if it does. Also, if you wanted to get other project data, you'd need to join against the projects table.
I know this is an old post, but in Django 2.0, I think you could just use:
Log.objects.values('project').distinct().order_by('project')[:4]
You need two querysets. The good thing is it still results in a single trip to the database (though there is a subquery involved).
latest_ids_per_project = Log.objects.values_list(
'project').annotate(latest=Max('date')).order_by(
'-latest').values_list('project')
log_objects = Log.objects.filter(
id__in=latest_ids_per_project[:4]).order_by('-date')
This looks a bit convoluted, but it actually results in a surprisingly compact query:
SELECT "log"."id",
"log"."project_id",
"log"."msg"
"log"."date"
FROM "log"
WHERE "log"."id" IN
(SELECT U0."id"
FROM "log" U0
GROUP BY U0."project_id"
ORDER BY MAX(U0."date") DESC
LIMIT 4)
ORDER BY "log"."date" DESC

postgresql full text search query to django ORM

I was following the documentation on FullTextSearch in postgresql. I've created a tsvector column and added the information i needed, and finally i've created an index.
Now, to do the search i have to execute a query like this
SELECT *, ts_rank_cd(textsearchable_index_col, query) AS rank
FROM client, plainto_tsquery('famille age') query
WHERE textsearchable_index_col ## query
ORDER BY rank DESC LIMIT 10;
I want to be able to execute this with Django's ORM so i could get the objects. (A little question here: do i need to add the tsvector column to my model?)
My guess is that i should use extra() to change the "where" and "tables" in the queryset
Maybe if i change the query to this, it would be easier:
SELECT * FROM client
WHERE plainto_tsquery('famille age') ## textsearchable_index_col
ORDER BY ts_rank_cd(textsearchable_index_col, plainto_tsquery(text_search)) DESC LIMIT 10
so id' have to do something like:
Client.objects.???.extra(where=[???])
Thxs for your help :)
Another thing, i'm using Django 1.1
Caveat: I'm writing this on a wobbly train, with a headcold, but this should do the trick:
where_statement = """plainto_tsquery('%s') ## textsearchable_index_col
ORDER BY ts_rank_cd(textsearchable_index_col,
plainto_tsquery(%s))
DESC LIMIT 10"""
qs = Client.objects.extra(where=[where_statement],
params=['famille age', 'famille age'])
If you were on Django 1.2 you could just call:
Client.objects.raw("""
SELECT *, ts_rank_cd(textsearchable_index_col, query) AS rank
FROM client, plainto_tsquery('famille age') query
WHERE textsearchable_index_col ## query
ORDER BY rank DESC LIMIT 10;""")