Slow spatial join on a single table with PostGIS - django

My goal is to calculate if a building have at least one shared wall with a building of another estate. I used a PostGIS query to do so but it is really slow. I have tweaked this for two weeks with some success but no breakthrough.
I have two tables:
Estate (a piece of land)
CREATE TABLE IF NOT EXISTS public.front_estate
(
id integer NOT NULL DEFAULT nextval('front_estate_id_seq'::regclass),
perimeter geometry(Polygon,4326),
CONSTRAINT front_estate_pkey PRIMARY KEY (id),
)
CREATE INDEX IF NOT EXISTS front_estate_perimeter_idx
ON public.front_estate USING spgist
(perimeter);
Building
CREATE TABLE IF NOT EXISTS public.front_building
(
id integer NOT NULL DEFAULT nextval('front_building_id_seq'::regclass),
type character varying(255) COLLATE pg_catalog."default",
footprint integer,
polygon geometry(Polygon,4326),
shared_wall integer,
CONSTRAINT front_building_pkey PRIMARY KEY (id)
)
CREATE INDEX IF NOT EXISTS front_building_polygon_idx
ON public.front_building USING spgist
(polygon)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_building_type_124fcf82
ON public.front_building USING btree
(type COLLATE pg_catalog."default" ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_building_type_124fcf82_like
ON public.front_building USING btree
(type COLLATE pg_catalog."default" varchar_pattern_ops ASC NULLS LAST)
TABLESPACE pg_default;
The m2m relation:
CREATE TABLE IF NOT EXISTS public.front_estate_buildings
(
id integer NOT NULL DEFAULT nextval('front_estate_buildings_id_seq'::regclass),
estate_id integer NOT NULL,
building_id integer NOT NULL,
CONSTRAINT front_estate_buildings_pkey PRIMARY KEY (id),
CONSTRAINT front_estate_buildings_estate_id_building_id_863b3358_uniq UNIQUE (estate_id, building_id),
CONSTRAINT front_estate_buildin_building_id_fc5c4235_fk_front_bui FOREIGN KEY (building_id)
REFERENCES public.front_building (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
DEFERRABLE INITIALLY DEFERRED,
CONSTRAINT front_estate_buildings_estate_id_2c28ec2a_fk_front_estate_id FOREIGN KEY (estate_id)
REFERENCES public.front_estate (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
DEFERRABLE INITIALLY DEFERRED
)
CREATE INDEX IF NOT EXISTS front_estate_buildings_building_id_fc5c4235
ON public.front_estate_buildings USING btree
(building_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX IF NOT EXISTS front_estate_buildings_estate_id_2c28ec2a
ON public.front_estate_buildings USING btree
(estate_id ASC NULLS LAST)
TABLESPACE pg_default;
To have a shared wall a building must touch another building which is not the same estate.
The final data set will have around 100 millions rows. Right now my developpement building table has 2 millions rows.
Here is the query I used to get all relations between buildings and estates:
SELECT b.id as b_id, rel.estate_id as e_id, swb.id as swb_id, sw_rel.estate_id as swe_id
FROM front_building b
JOIN front_building swb ON swb.id < b.id AND ST_Intersects(b.polygon, swb.polygon)
JOIN front_estate_buildings rel ON rel.building_id = b.id
JOIN front_estate_buildings sw_rel ON sw_rel.building_id = swb.id
ORDER BY b.id ASC;
Here is the EXPLAIN ANALYZE given by pgAdmin:
1. Limit (rows=500 loops=1)
2. Nested Loop Inner Join (rows=500 loops=1)
3. Nested Loop Inner Join (rows=695 loops=1)
4. Nested Loop Inner Join (rows=2985 loops=1)
5. Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as rel (rows=2985 loops=1) 2985 1
6. Memoize (rows=1 loops=2985)
Buckets: Batches: Memory Usage: 715 kB
7. Index Scan using front_building_pkey on front_building as b (rows=1 loops=2751)
Index Cond: (id = rel.building_id)
8. Index Scan using front_building_polygon_idx on front_building as swb (rows=0 loops=2985)
Filter: ((id < b.id) AND st_intersects(b.polygon, polygon))
Index Cond: (polygon && b.polygon)
Rows Removed by Filter: 2
9. Index Scan using front_estate_buildings_building_id_fc5c4235 on front_estate_buildings as sw_rel (rows=1 loops=695)
Index Cond: (building_id = swb.id)
On my dev machine (MBP M1 - 16GB RAM) it completes in 20 minutes for 2M rows which is not good but is ok. On my production machine (Linode - 8 CPU Cores - 16 GB RAM) the CPU goes throufh the roof (continuous 250% capacity) and the query seems to never end.
Do you have any clue on how to proceed ? Change the query ? The db struct ? Use multiprocessing ?

Related

Why does left join in redshift not working?

We are facing a weird issue with Redshift and I am looking for help to debug it please. Details of the issue are following:
I have 2 tables and I am trying to perform left join as follows:
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.context_id = e.context_id**
where ot.order_id = '222:102'
Above query returns ~7000 records. Looks like it is performing default join as we have only 1 record in [Orders] table with Order ID = ‘222:102’
select count(*)
from abc.orders ot
left outer join abc.events e on **ot.event_id = e.event_id**
where ot.order_id = '222:102'
Above query returns 1 record correctly. If you notice, I have just changed column for joining 2 tables. Event_ID in [Events] table is identity column but I thought I should get similar records even if I use any other column like Context_ID.
Further, I tried following query under the impression it should return all the ~7000 records as I am using default join but surprisingly it returned only 1 record.
select count(*)
from abc.orders ot
**join** abc.events e on ot.event_id = e.event_id
where ot.order_id = '222:102'
Following are the Redshift database details:
Cutdown version of table metadata:
CREATE TABLE abc.orders (
order_id character varying(30) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
event_id character varying(21) NOT NULL ENCODE zstd,
FOREIGN KEY (event_id) REFERENCES events_20191014(event_id)
)
DISTSTYLE EVEN
SORTKEY ( context_id, order_id );
CREATE TABLE abc.events (
event_id character varying(21) NOT NULL ENCODE raw,
context_id integer ENCODE raw,
PRIMARY KEY (event_id)
)
DISTSTYLE ALL
SORTKEY ( context_id, event_id );
Database: Amazon Redshift cluster
I think, I am missing something essential while joining the tables. Could you please guide me in right direction?
Thank you

Django PostgreSQL double index cleanup

We've got this table in our database with 80GB of data and 230GB of Indexes. We are constrained on our disk which is already maxed out.
What bothers me is we have two indexes that look pretty darn similar
CREATE INDEX tracks_trackpoint_id ON tracks_trackpoint USING btree (id)
CREATE UNIQUE INDEX tracks_trackpoint_pkey ON tracks_trackpoint USING btree (id)
I have no idea what's the history behind this, but the first one seems quite redundant. What could be the risk of dropping it ? This would buy us one year of storage.
You can drop the first index, it is totally redundant.
If your tables are 80GB and your indexes 230GB, I am ready to bet that you have too many indexes in your database.
Drop the indexes that are not used.
Careful as I am, I disabled the index to benchmark this, and the query seems to fallback nicely on the other index. I'll try a few variants.
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_id on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.040 ms
(3 rows)
appdb=# UPDATE pg_index SET indisvalid = FALSE WHERE indexrelid = 'tracks_trackpoint_id'::regclass;
appdb=# EXPLAIN analyze SELECT * FROM tracks_trackpoint where id=266082;
Index Scan using tracks_trackpoint_pkey on tracks_trackpoint (cost=0.57..8.59 rows=1 width=48) (actual time=0.013..0.013 rows=0 loops=1)
Index Cond: (id = 266082)
Total runtime: 0.036 ms
(3 rows)

How to add a partition boundary only when not exists in SQL Data Warehouse?

I am using Azure SQL Data Warehouse Gen 1, and I create a partition table like this
CREATE TABLE [dbo].[StatsPerBin1](
[Bin1] [varchar](100) NOT NULL,
[TimeWindow] [datetime] NOT NULL,
[Count] [int] NOT NULL,
[Timestamp] [datetime] NOT NULL)
WITH
(
DISTRIBUTION = HASH ( [Bin1] ),
CLUSTERED INDEX([Bin1]),
PARTITION
(
[TimeWindow] RANGE RIGHT FOR VALUES ()
)
)
How should I split a partition only when there is no such boundary?
First I think if I can get partition boundaries by table name, then I can write a if statement to determine add partition boundary or not.
But I cannot find a way to associate a table with its corresponding partition values, the partition values of all partitions can be retrieved by
SELECT * FROM sys.partition_range_values
But it only contains function_id as identifier which I don't know how to join other tables so that I can get partition boundaries by table name.
Have you tried joining sys.partition_range_values with sys.partition_functions view?
Granted we cannot create partition functions in SQL DW, but the view seems to be still supported.
I know this is an out of date question, but I was having the same problem. Here is a query I ended up with that can get you started. It is modified slightly from a query for SQL Server documentation:
SELECT s.[name] AS [schema_name]
, t.[name] AS [table_name]
, p.[partition_number] AS [partition_number]
, rv.[value] AS [partition_boundary_value]
, p.[data_compression_desc] AS [partition_compression_desc]
FROM sys.schemas s
JOIN sys.tables t ON t.[schema_id] = s.[schema_id]
JOIN sys.partitions p ON p.[object_id] = t.[object_id]
JOIN sys.indexes i ON i.[object_id] = p.[object_id]
AND i.[index_id] = p.[index_id]
JOIN sys.data_spaces ds ON ds.[data_space_id] = i.[data_space_id]
LEFT JOIN sys.partition_schemes ps ON ps.[data_space_id] = ds.[data_space_id]
LEFT JOIN sys.partition_functions pf ON pf.[function_id] = ps.[function_id]
LEFT JOIN sys.partition_range_values rv ON rv.[function_id] = pf.[function_id]
AND rv.[boundary_id] = p.[partition_number]

What is the difference between scan and query in dynamodb? When use scan / query?

A query operation as specified in DynamoDB documentation:
A query operation searches only primary key attribute values and supports a subset of comparison operators on key attribute values to refine the search process.
and the scan operation:
A scan operation scans the entire table. You can specify filters to apply to the results to refine the values returned to you, after the complete scan.
Which is best based on performance and cost?
When creating a Dynamodb table select Primary Keys and Local Secondary Indexes (LSIs) so that a Query operation returns the items you want.
Query operations only support an equal operator evaluation of the Primary Key, but conditional (=, <, <=, >, >=, Between, Begin) on the Sort Key.
Scan operations are generally slower and more expensive as the operation has to iterate through each item in your table to get the items you are requesting.
Example:
Table: CustomerId, AccountType, Country, LastPurchase
Primary Key: CustomerId + AccountType
In this example, you can use a Query operation to get:
A CustomerId with a conditional filter on AccountType
A Scan operation would need to be used to return:
All Customers with a specific AccountType
Items based on conditional filters by Country, ie All Customers from USA
Items based on conditional filters by LastPurchase, ie All Customers that made a purchase in the last month
To avoid scan operations on frequently used operations create a Local Secondary Index (LSI) or Global Secondary Index (GSI).
Example:
Table: CustomerId, AccountType, Country, LastPurchase
Primary Key: CustomerId + AccountType
GSI: AccountType + CustomerId
LSI: CustomerId + LastPurchase
In this example a Query operation can allow you to get:
A CustomerId with a conditional filter on AccountType
[GSI] A conditional filter on CustomerIds for a specific AccountType
[LSI] A CustomerId with a conditional filter on LastPurchase
You are having dynamodb table partition key/primary key as customer_country. If you use query, customer_country is the mandatory field to make query operation. All the filters can be made only items that belongs to customer_country.
If you perform table scan the filter will be performed on all partition key/primary key. First it fetched all data and apply filter after fetching from table.
eg:
here customer_country is the partition key/primary key
and id is the sort_key
-----------------------------------
customer_country | name | id
-----------------------------------
VV | Tom | 1
VV | Jack | 2
VV | Mary | 4
BB | Nancy | 5
BB | Lom | 6
BB | XX | 7
CC | YY | 8
CC | ZZ | 9
------------------------------------
If you perform query operation it applies only on customer_country value.
The value should only be equal operator (=).
So only items equal to that partition key/primary key value are fetched.
If you perform scan operation it fetches all items in that table and filter out data after it takes that data.
Note: Don't perform scan operation it exceeds your RCU.
Its similar as in the relational database.
Get query you are using a primary key in where condition, The computation complexity is log(n) as the most of key structure is binary tree.
while scan query you have to scan whole table then apply filter on every single row to find the right result. The performance is O(n). Its much slower if your table is big.
In short, Try to use query if you know primary key. only scan for only the worst case.
Also, think about the global secondary index to support a different kind of queries on different keys to gain performance objective
In terms of performance, I think it's good practice to design your table for applications to use Query instead of Scan. Because a scan operation always scan the entire table before it filters out the desired values, which means it takes more time and space to process data operations such as read, write and delete. For more information, please refer to the official document
Query is much better than Scan - performence wise. scan, as it's name imply, will scan the whole table. But you must be well aware of the table key, sort key, indexes and and related sort indexes in order to know that you can use the Query.
if you filter your query using:
key
key & key sort
index
index and it's related sort key
use Query! otherwise use scan which is more flexible about which columns you can filter.
you can NOT Query if:
more that 2 fields in the filter (e.g. key, sort and index)
sort key only (of primary key or index)
regular fields (not key, index or sort)
mixed index and sort (index1 with sort of index2)\
...
a good explaination:
https://medium.com/#amos.shahar/dynamodb-query-vs-scan-sql-syntax-and-join-tables-part-1-371288a7cb8f

Having trouble summing columns in SQL Server joined view

Moi guys, Matt here. I'm having trouble with a relatively complicated view. I have a parts and service table that each have unique identifiers for a given part/service. I'm trying to link these to a service invoice table and subsequent view as a M:N relationship, so I've set up intermediary relational tables, with both the invoice number (invoice primary key) and part/service number (part/service primary key) as the combined primary key. Here's my code for the whole relationship and view:
CREATE TABLE service_invoice
( servinv_Num VARCHAR2(10) CONSTRAINT serv_snum_PK PRIMARY KEY,
servinv_EmpID NUMBER(6) CONSTRAINT serv_empnum_FK REFERENCES employee(empID),
servinv_CustID NUMBER(6) CONSTRAINT serv_custid_FK REFERENCES customer(custID),
servinv_VIN VARCHAR2(25) CONSTRAINT serv_VIN_FK REFERENCES vehicle(vehicle_vin),
servinv_Terms VARCHAR2(6) CONSTRAINT serv_trms_NN NOT NULL,
servinv_Date DATE );
CREATE TABLE Parts
( PartID VARCHAR2(10) CONSTRAINT Part_PartID_PK PRIMARY KEY,
PartDesc VARCHAR2(50) CONSTRAINT Part_PartDesc_NN NOT NULL,
PartCharge NUMBER(4,2) CONSTRAINT Part_PartCharge_NN NOT NULL );
CREATE TABLE Service
( ServiceID VARCHAR2(10) CONSTRAINT Serv_ServID_PK PRIMARY KEY,
ServDesc VARCHAR2(50) CONSTRAINT Serv_ServName_NN NOT NULL,
ServCharge NUMBER(4,2) CONSTRAINT Serv_ServCharge_NN NOT NULL );
CREATE TABLE Serv_SI_Rel
( SI_num VARCHAR2(10) CONSTRAINT ServSI_SInum_FK REFERENCES service_invoice(servinv_Num),
ServiceID VARCHAR2(10) CONSTRAINT ServSI_ServID_FK REFERENCES Service(ServiceID),
CONSTRAINT ServSI_SInum_ServID_PK PRIMARY KEY(SI_num, ServiceID) );
CREATE TABLE Parts_SI_Rel
( SI_num VARCHAR2(10) CONSTRAINT PartSI_SInum_FK REFERENCES service_invoice(servinv_Num),
PartID VARCHAR2(10) CONSTRAINT PartSI_PartID_FK REFERENCES Parts(PartID),
CONSTRAINT PartSI_SInum_PartID_PK PRIMARY KEY(SI_num, PartID) );
CREATE OR REPLACE VIEW ServiceInvoiceDoc
AS
(
SELECT si.servinv_Num, si.servinv_Date, si.servinv_Terms,
es.empName,
sc.custName, sc.custHouse, sc.custCity,
sc.custState, sc.custZIP, sc.custPhone, sc.custEmail,
sv.vehicle_VIN, sv.vehicle_mileage,
srel.ServiceID,
prel.PartID,
s.ServDesc, s.ServCharge,
p.PartDesc, p.PartCharge,
SUM(s.ServCharge) TotalServCharges,
SUM(p.PartCharge) TotalPartsCharges,
( SUM(s.ServCharge)+SUM(p.PartCharge) ) SubTotalCharges,
( SUM(s.ServCharge)+SUM(p.PartCharge) )*0.0825 Taxes,
( SUM(s.ServCharge)+SUM(p.PartCharge) )*1.0825 TotalCharges
FROM service_invoice si
JOIN employee es
ON (es.empID = si.servinv_EmpID)
JOIN customer sc
ON (sc.custID = si.servinv_CustID)
JOIN vehicle sv
ON (sv.vehicle_VIN = si.servinv_VIN)
LEFT OUTER JOIN Serv_SI_Rel srel
ON (srel.SI_Num = si.servinv_Num)
LEFT OUTER JOIN Parts_SI_Rel prel
ON (prel.SI_num = si.servinv_Num)
JOIN Parts p
ON (prel.PartID = p.PartID)
JOIN Service s
ON (srel.ServiceID = s.ServiceID) );
The error I get has to do with summing the individual parts and service charges in the M:N relationship. Here's the error code from the run:
ORA-00937: not a single-group group function
I've tried fixing with a group by command, but the grouping identifier (service invoice) isn't included on the part or service tables, and the joins don't seem to link these up for a group. e.g. I tried calling GROUP BY si.servinv_Num
Can this be resolved at all or is it completely wrong? I have the option of dropping the M:N relationship as a 1:M and simply making a separate invoice for each part/service charge, but I would prefer to keep it compact and professional.
Any help would be greatly appreciated. Thank you so much for your time!
a) wrong tag
b) I'd imagine you would need to list all columns in group by clause that aren't aggregated, as per Oracle
...
group by si.servinv_Num, si.servinv_Date, si.servinv_Terms,
es.empName,
sc.custName, sc.custHouse, sc.custCity,
sc.custState, sc.custZIP, sc.custPhone, sc.custEmail,
sv.vehicle_VIN, sv.vehicle_mileage,
srel.ServiceID,
prel.PartID,
s.ServDesc, s.ServCharge,
p.PartDesc, p.PartCharge