Query efficiency - django

This seemingly simple query (1 join) takes many hours to run even though the table containes less than 1.5 million rows...
I have Product items which have a one-to-many relationship with RetailerProduct items and I would like to find all Product items whose related RetailerProducts do not contain any instances of retailer_id=1.
There are about 1.5 million rows in Product, and about 1.1 million rows in RetailerProduct with retailer_id=1 (2.9 million in total in RetailerProduct)
Models:
class Product(models.Model):
...
upc = models.CharField(max_length=96, unique=True)
...
class RetailerProduct(models.Model):
...
product = models.ForeignKey('project.Product',
related_name='retailer_offerings',
on_delete=models.CASCADE,
null=True)
...
class Meta:
unique_together = (("retailer", "retailer_product_id", "retailer_sku"),)
Query:
Product.objects.exclude(
retailer_offerings__retailer_id=1).values_list('upc', flat=True)
Generated SQL:
SELECT "project_product"."upc" FROM "project_product"
WHERE NOT ("project_product"."id" IN
(SELECT U1."product_id" AS Col1 FROM "project_retailerproduct" U1
WHERE (U1."retailer_id" = 1 AND U1."product_id" IS NOT NULL))
)
Running that query takes hours.
An EXPLAIN in the psql shell renders:
QUERY PLAN
---------------------------------------------------------------------------------------------------
Seq Scan on project_product (cost=0.00..287784596160.17 rows=725892 width=13)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..393961.19 rows=998211 width=4)
-> Seq Scan on project_retailerproduct u1 (cost=0.00..385070.14 rows=998211 width=4)
Filter: ((product_id IS NOT NULL) AND (retailer_id = 1))
(6 rows)
I wanted to post the EXPLAIN ANALYZE but it's still running.
Why is the cost so high for Seq Scan on project_product? Any optimization suggestions?

1.1 million rows in RetailerProduct with retailer_id=1 (2.9 million in total in RetailerProduct)
You are selecting 1.1 million rows out of 2.9 million rows. Even if you had an index on retailer_id it wouldn't be much use here. You are looking at almost half the table here. This will need a full table scan.
Then let us recall that WHERE NOT IN type queries are generally slow. In your case you are comparing the product_id column against 1.1 million rows. Having done that you are actually fetching the rows, which probably amounts to several hundred thousand rows. You might want to consider a LIMIT but even then the query probably wouldn't be a whole lot faster.
So this is not a query that can easily be optimized. You might want to use a completely different query. Here is an example raw query
SELECT "project_product"."upc" FROM "project_product" LEFT JOIN
(SELECT product_id FROM "project_retailerproduct"
WHERE retailer_id = 1)
AS retailer
ON project_product.id = retailer.product_id WHERE
WHERE retailer.product_id IS NULL

Related

Power BI - How can I create a measure aggregation from two tables?

I've got two tables Fauna_Afetada (animals affected) e Municipios (all cities of the country). A fauna afetada tem a sigla_uf (state initials which the city belongs to) e nome_municipio (name of the city), as well as the Municipios table.
I'd like to create a measure counting the number of deaths (Situação_Int = 0 in table Fauna_Afetada, specify that the animal died) by city (table Municipios). The way to join these tables is by using two columns (the sigla_uf and nome_municipio), because each Municipio in Fauna_Afetada might appear several times. How do I do that?
I'd like to create a measure with the name of the city with the highest number of deaths.
Anyone could help?
The data model is below.
It seems it is a data modeling issue in the first galce,
The model is full of bidirectional relationships that lead to not expect calculations behaviors.
Can you share the file or part of it

Power BI - totals per related entity

Basically, I’d like to get one entity totals, but calculated for another (but still related/associated!) entity. Relation type between these entities is many-to-many.
Just to be less abstract, let’s take Trips and Shipments as mentioned entities and Shipments’ weight as a total to be calculated.
Calculating weight totals just per each trip is pretty easy task. Here is a table of Shipments weights:
We place them into some amounts of trucks/trips and get following weight totals per trip:
But when I try to show SUM of Trip weight totals (figures from 2nd table) per each related Shipment (Column from 1st table), it becomes much harder than I expect.
It should look like:
And I can’t get such table within Power BI.
Data model for your reference:
Seems like SUMMARIZE function is almost fit, but it doesn’t allow me to use a column from another table than initialized in the function:
Additional restrictions:
Selections made by user (clicks on cells etc.) should not affect calculation anyhow.
The figures should be able to be used in further calculations, using them as a basis.
Can anyone advise a solution? Or at least proper DAX references to consider? I thought I could find a quick answer in DAX reference guide on my own. However I failed to find a quick answer.
Version 1
Try the following DAX function as a calculated column inside of your shipments table:
TripWeight =
VAR tripID =
RELATED ( Trips[TripID] )
RETURN
CALCULATE (
SUM ( Shipments[ShipmentTaxWeightKG] );
FILTER ( Shipments; RELATED ( InkTable[TripID] ) = tripID )
)
The first expression var tripID is storing the TripID of the current row and the CALCULATE function gets the SUM of all of the weight for all the shipments that belong to the current trip.
Version 2
You can also create a calculated table using the following DAX and create a relationship between the newly created table and your Trips table and simply display the weight in this table:
TripWeight =
GROUPBY (
Shipments;
Trips[TripID];
"Total Weight KG"; SUMX ( CURRENTGROUP (); Shipments[ShipmentTaxWeightKG] )
)
Version 3
Version 1 and 2 are only working if the relationship between lnkTrip and Shipment is a One-to-One relationship. If it is a many-to-one relationship, the following calculated column can be created inside of the Trips table:
ShipmentTaxWeightKG by Trip = SUMX(RELATEDTABLE(Shipments); Shipments[ShipmentTaxWeightKG])
hope this helps.

Uncached Lookup Performance Issue

I have inventory_history table that contains 2 millions of data, on which I am performing uncached lookup.
From source table, I am retrieving last 3 months data, which is around 300 thousand rows.
My mapping contains single lookup and is uncached (inventory_history). Lookup overide is used to retrieve data from inventory_history table, the condition columns are indexed and not using any unwanted columns.
But I see the t/m busy percentage is 100% and is like below 100. Lookup override query is executing well in database. This mapping is taking forever time. How can I tune the performance.
Don't know where the problem exists... Any suggestions ?
SELECT
SUM(CASE WHEN UPPER(GM) = 'B' and UNITS> 100 THEN A.QTY/B.UNITS ELSE QTY END) AS QTY,
A.TDATE as TDATE,
A.TDATE_ID as TDATE_ID,
A.DIST_ID as DIST_ID,
A.PRODID as PROD_ID
FROM
HUSA_ODS.INVENTORY_HISTORY A,
HUSA_ODS.PRODUCT B
WHERE
A.PROD_ID = B.PROD_ID
AND TCODE = '10' AND
DISTID = ?DISTID_IN?
AND A.PROD_ID = ?PROD_ID_IN?
AND TDATE <= ?PERIOD_DATE_IN?
GROUP BY
TDATE,
TDATE_ID,
DIST_ID,
A.PROD_ID
ORDER BY
TDATE DESC,
DIST_ID,
A.PROD_ID , TDATE--
Here output columns are QTY and TDATE
Uncached lookup would hit the database for each of your 300 thousand rows coming from source. Use a cached lookup and see if you can filter out some of the data in your lookup query.

How to use subquery in django?

I want to get a list of the latest purchase of each customer, sorted by the date.
The following query does what I want except for the date:
(Purchase.objects
.all()
.distinct('customer')
.order_by('customer', '-date'))
It produces a query like:
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
I am forced to use customer_id as the first ORDER BY expression because of DISTINCT ON.
I want to sort by the date, so what the query I really need should look like this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I don't want to sort using python because I still got to page limit the query. There can be tens of thousands of rows in the database.
In fact it is currently sorted by in python now and is causing very long page load times, so that's why I'm trying to fix this.
Basically I want something like this https://stackoverflow.com/a/9796104/242969. Is it possible to express it with django querysets instead of writing raw SQL?
The actual models and methods are several pages long, but here is the set of models required for the queryset above.
class Customer(models.Model):
user = models.OneToOneField(User)
class Purchase(models.Model):
customer = models.ForeignKey(Customer)
date = models.DateField(auto_now_add=True)
item = models.CharField(max_length=255)
If I have data like:
Customer A -
Purchase(item=Chair, date=January),
Purchase(item=Table, date=February)
Customer B -
Purchase(item=Speakers, date=January),
Purchase(item=Monitor, date=May)
Customer C -
Purchase(item=Laptop, date=March),
Purchase(item=Printer, date=April)
I want to be able to extract the following:
Purchase(item=Monitor, date=May)
Purchase(item=Printer, date=April)
Purchase(item=Table, date=February)
There is at most one purchase in the list per customer. The purchase is each customer's latest. It is sorted by latest date.
This query will be able to extract that:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id"
"shop_purchase.id"
"shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC,
"shop_purchase.date" DESC;
)
AS result
ORDER BY date DESC;
I'm trying to find a way not to have to use raw SQL to achieve this result.
This may not be exactly what you're looking for, but it might get you closer. Take a look at Django's annotate.
Here is an example of something that may help:
from django.db.models import Max
Customer.objects.all().annotate(most_recent_purchase=Max('purchase__date'))
This will give you a list of your customer models each one of which will have a new attribute called "most_recent_purchase" and will contain the date on which they made their last purchase. The sql produced looks like this:
SELECT "demo_customer"."id",
"demo_customer"."user_id",
MAX("demo_purchase"."date") AS "most_recent_purchase"
FROM "demo_customer"
LEFT OUTER JOIN "demo_purchase" ON ("demo_customer"."id" = "demo_purchase"."customer_id")
GROUP BY "demo_customer"."id",
"demo_customer"."user_id"
Another option, would be adding a property to your customer model that would look something like this:
#property
def latest_purchase(self):
return self.purchase_set.order_by('-date')[0]
You would obviously need to handle the case where there aren't any purchases in this property, and this would potentially not perform very well (since you would be running one query for each customer to get their latest purchase).
I've used both of these techniques in the past and they've both worked fine in different situations. I hope this helps. Best of luck!
Whenever there is a difficult query to write using Django ORM, I first try the query in psql(or whatever client you use). The SQL that you want is not this:
SELECT * FROM (
SELECT DISTINCT ON
"shop_purchase.customer_id" "shop_purchase.id" "shop_purchase.date"
FROM "shop_purchase"
ORDER BY "shop_purchase.customer_id" ASC, "shop_purchase.date" DESC;
) AS result
ORDER BY date DESC;
In the above SQL, the inner SQL is looking for distinct on a combination of (customer_id, id, and date) and since id will be unique for all, you will get all records from the table. I am assuming id is the primary key as per convention.
If you need to find the last purchase of every customer, you need to do something like:
SELECT "shop_purchase.customer_id", max("shop_purchase.date")
FROM shop_purchase
GROUP BY 1
But the problem with the above query is that it will give you only the customer name and date. Using that will not help you in finding the records when you use these results in a subquery.
To use IN you need a list of unique parameters to identify a record, e.g., id
If in your records id is a serial key, then you can leverage the fact that the latest date will be the maximum id as well. So your SQL becomes:
SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id";
Note that I kept only one field (id) in the selected clause to use it in a subquery using IN.
The complete SQL will now be:
SELECT *
FROM shop_customer
WHERE "shop_customer.id" IN
(SELECT max("shop_purchase.id")
FROM shop_purchase
GROUP BY "shop_purchase.customer_id");
and using the Django ORM it looks like:
(Purchase.objects.filter(
id__in=Purchase.objects
.values('customer_id')
.annotate(latest=Max('id'))
.values_list('latest', flat=True)))
Hope it helps!
I have a similar situation and this is how I'm planning to go about it:
query = Purchase.objects.distinct('customer').order_by('customer').query
query = 'SELECT * FROM ({}) AS result ORDER BY sent DESC'.format(query)
return Purchase.objects.raw(query)
Upside it gives me the query I want. Downside is that it is raw query and I can't append any other queryset filters.
This is my approach if I need some subset of data (N items) along with the Django query. This is example using PostgreSQL and handy json_build_object() function (Postgres 9.4+), but same way you can use other aggregate function in other database system. For older PostgreSQL versions you can use combination of array_agg() and array_to_string() functions.
Imagine you have Article and Comment models and along with every article in the list you want to select 3 recent comments (change LIMIT 3 to adjust size of subset or ORDER BY c.id DESC to change sorting of subset).
qs = Article.objects.all()
qs = qs.extra(select = {
'recent_comments': """
SELECT
json_build_object('comments',
array_agg(
json_build_object('id', id, 'user_id', user_id, 'body', body)
)
)
FROM (
SELECT
c.id,
c.user_id,
c.body
FROM app_comment c
WHERE c.article_id = app_article.id
ORDER BY c.id DESC
LIMIT 3
) sub
"""
})
for article in qs:
print(article.recent_comments)
# Output:
# {u'comments': [{u'user_id': 1, u'id': 3, u'body': u'foo'}, {u'user_id': 1, u'id': 2, u'body': u'bar'}, {u'user_id': 1, u'id': 1, u'body': u'joe'}]}
# ....

Slow Postgres JOIN Query

I'm trying to optimize a slow query that was generated by the Django ORM. It is a many-to-many query. It takes over 1 min to run.
The tables have a good amount of data, but they aren't huge (400k rows in sp_article and 300k rows in sp_article_categories)
#categories.article_set.filter(post_count__lte=50)
EXPLAIN ANALYZE SELECT *
FROM "sp_article"
INNER JOIN "sp_article_categories" ON ("sp_article"."id" = "sp_article_categories"."article_id")
WHERE ("sp_article_categories"."category_id" = 1081
AND "sp_article"."post_count" <= 50 )
Nested Loop (cost=0.00..6029.01 rows=656 width=741) (actual time=0.472..25.724 rows=1266 loops=1)
-> Index Scan using sp_article_categories_category_id on sp_article_categories (cost=0.00..848.82 rows=656 width=12) (actual time=0.015..1.305 rows=1408 loops=1)
Index Cond: (category_id = 1081)
-> Index Scan using sp_article_pkey on sp_article (cost=0.00..7.88 rows=1 width=729) (actual time=0.014..0.015 rows=1 loops=1408)
Index Cond: (sp_article.id = sp_article_categories.article_id)
Filter: (sp_article.post_count <= 50)
Total runtime: 26.536 ms
I have an index on:
sp_article_categories.article_id (type: btree)
sp_article_categories.category_id
sp_article.post_count (type: btree)
Any suggestions on how I can tune this to get the query speedy?
Thanks!
You've provided the vital information here - the explain analyse. That isn't showing a 1 second runtime though, it's showing 20 milliseconds. So - either that isn't the query being run, or the problem is elsewhere.
The only difference between explain analyse and a real application is that the results aren't actually returned. You would need a lot of data to slow things down to 1 second though.
The other suggestions are all off the mark since they're ignoring the fact that the query isn't slow. You have the relevant indexes (both sides of the join are using an index scan) and the planner is perfectly capable of filtering on the category table first (that's the whole point of having a half decent query planner).
So - you first need to figure out what exactly is slow...
Put an index on sp_article_categories.category_id
From a pure SQL perspective, your join is more efficient if your base table has fewer rows in it, and the WHERE conditions are performed on that table before it joins to another.
So see if you can get Django to select from the categories first, then filter the category_id before joining to the article table.
Pseudo-code follows:
SELECT * FROM categories c
INNER JOIN articles a
ON c.category_id = 1081
AND c.category_id = a.category_id
And put an index on category_id like Steven suggests.
You can use field names instead * too.
select [fields] from....
I assume you have run analyze on the database to get fresh statistics.
It seems that the join between sp_article.id and sp_article_categories.article_id is costly. What data type is the article id, numeric? If it isn't you should perhaps consider making it numeric - integer or bigint, whatever suites your needs. It can make a big difference in performance according to my experience. Hope it helps.
Cheers!
// John