Optimize heavy BigQuery DELETE query - google-cloud-platform

The following BigQuery DELETE query fails by a timeout, because it reaches the limit of 6 hours of execution time:
DELETE animals A WHERE EXISTS
(SELECT id from pets P WHERE A.id = P.id)
Table animals has ~50.000.000.000 records.
Table pets has ~300.000 records.
Tables are not partitioned.
Edit:
Seems like this query does not give any improvement:
DELETE animals WHERE id IN
(SELECT id from pets)

SELECT id FROM(
SELECT id, tbl, DENSE_RANK OVER(PARTITION BY id ORDER BY tbl) AS rk FROM (
SELECT id, 1 AS tbl FROM animals
UNION ALL
SELECT id, 0 AS tbl FROM pets)
)
) WHERE rk = 1 AND tbl = 1;
This code will give you all the ids from animals which do not exist in pets.
If id is unique in animals you can use ROW_NUMBER() instead of DENSE_RANK().

Related

Getting table names and row counts for all tables in an athena database

I have an AWS database with multiple tables that I am trying to get the row counts for in a single query.
The ideal query output would be:
table_name row_count
table2_name row_count
etc...
So far I've been able to either get all the table names from the database or all the rowcounts of the tables (in random order), but not both in the same query.
This query returns a column of all the table names that exist in the database:
SELECT table_name FROM information_schema.tables WHERE table_schema = '<database_name>';
This query returns all the row counts for the tables:
SELECT COUNT(*) FROM table_name
UNION ALL
SELECT COUNT(*) FROM table2_name
UNION ALL
etc..for the rest of the tables
The issue with this query is that is displays the row counts in a random order that doesn't correspond with the order of the tables in the query, and so I don't know which row count goes with which table - hence why I need both the table names and row counts.
Simply add the names of the tables as literals in your queries:
SELECT 'table_name' AS table_name, COUNT(*) AS row_count FROM table_name
UNION ALL
SELECT 'table_name2' AS table_name, COUNT(*) AS row_count FROM table_name2
UNION ALL
…
The following query generates the UNION query to produce counts of all records.
The problem to solve is that (as of December 2022) INFORMATION_SCHEMA.TABLES incorrectly defines every table and view as a BASE TABLE so you will need some logic to eliminate the views.
In Data Warehousing it is common practise to record snapshots of the record counts of landing tables at frequent intervals. Any unexpected deviations from expected counts can be used for reporting/alerting
WITH Table_List AS (
SELECT table_schema,table_name, CONCAT('SELECT CURRENT_DATE AS run_date, ''',table_name, ''' AS table_name, COUNT(*) AS Records FROM "',table_schema,'"."', table_name, '"') AS BaseSQL
FROM INFORMATION_SCHEMA.TABLES
WHERE
table_schema = 'YOUR_DB_NAME' -- Change this
AND table_name LIKE 'YOUR TABLE PATTERN%' -- Change or remove this line
)
, Total_Records AS (
SELECT COUNT(*) AS Table_Count
FROM Table_List
)
SELECT
CASE WHEN ROW_NUMBER() OVER (ORDER BY table_name) = Table_Count
THEN BaseSQL
ELSE CONCAT(BaseSql, ' UNION ALL') END AS All_Table_Record_count_SQL
FROM Table_List CROSS JOIN Total_Records
ORDER BY table_name;

Redshift Auto Vacuum

I have recently implemented Distribution Styles and Sort Keys on a few Redshift Tables after doing the following analysis -
To find out the best candidate for Distribution Styles and Sort keys , I looked into the queries that run on these tables followed by shortlisting 1 column for Sort key & Distribution Key ( Instead of Style ) on the bases of Cardinality of the column ( Degree of Uniqueness of a columns data ) and also ran EXPLAIN query which clearly showed reduction in COST, Width & Row Scanned.
And on the basis of the above, I applied the sort key and dist key on few of the tables which produced the best performance.
Now After implementation , When I was looking into the pct_unsorted for these tables , then these are my observations -
Not All Tables were Auto Vacuumed by redshift , Some Had [BG] Vacuum run where as some others only had [BG] Vacuum Delete run on them. These were the tables with 100% Unsorted data -
Upon Looking into the Vacuum Status from stl_vacuum -
My Questions -
On What grounds does Redshift decide which kind of Vacuum is to be run ( Auto Vacuum )?
Why for some tables is redshift Running VACUUM and for others its running VACUUM DELETE ONLY ?
What is the most Ideal or best practice of VACUUM tables to get data sorted ?
Thanks in Advance.
Queries Run for the above shown results are -
select trim(pgn.nspname) as schema,
trim(a.name) as table, id as tableid,
decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey, dist_ratio.ratio::decimal(10,4) as skew,
det.head_sort as "sortkey",
det.n_sortkeys as "#sks", b.mbytes,
decode(b.mbytes,0,0,((b.mbytes/part.total::decimal)*100)::decimal(5,2)) as pct_of_total,
decode(det.max_enc,0,'n','y') as enc, a.rows,
decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows ,
decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted
from (select db_id, id, name, sum(rows) as rows,
sum(rows)-sum(sorted_rows) as unsorted_rows
from stv_tbl_perm a
group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
inner join (select attrelid,
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort ,
max(attsortkeyord) as n_sortkeys,
max(attencodingtype) as max_enc
from pg_attribute group by 1) as det
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice )
group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null
and "table" in (<TABLE NAMES>)
order by mbytes desc
&
select xid, a.table_id,b."table", status, rows, sortedrows, blocks, eventtime
from stl_vacuum a
inner join (select table_id,"table" from svv_table_info where "schema" in ('touchpoint','activehealth')
and "table" in (<TABLE NAMES>)
) b on a.table_id = b. table_id
UPDATE -
One interesting observation is , in the svv_table_info ,
For these tables with 100% unsorted data, the vacuum_sort_benefit is 0.

update with sub query successful but not updating

We have wrong duplicate id loaded in the table and we need to correct it.
The rules to update the id is whenever there is a time difference of more than 30 min, the id should be new/unique.
I have written the query to filter that out, however update is not happening
The below query is there to find the ids to be updated.
For testing I have used a particular id.
select id,
BEFORE_TIME,
TIMESTAMP,
datediff(minute,BEFORE_TIME,TIMESTAMP) time_diff,
row_number() over (PARTITION BY id ORDER BY TIMESTAMP) rowno,
concat(id,to_varchar(rowno)) newid from
(SELECT id,
TIMESTAMP,
LAG(TIMESTAMP_EST) OVER (PARTITION BY visit_id ORDER BY TIMESTAMP) as BEFORE_TIME
FROM table_name t
where id = 'XX1X2375'
order by TIMESTAMP_EST)
where BEFORE_TIME is not NULL and time_diff > 30
order by time_diff desc
;
And i could see the 12 record with same id and time difference more than 30.
However when I am trying to update. the query is succesfull but nothing is getting update.
update table_name t
set t.id = c.newid
from
(select id ,
BEFORE_TIME,
TIMESTAMP,
datediff(minute,BEFORE_TIME,TIMESTAMP) time_diff,
row_number() over (PARTITION BY id ORDER BY TIMESTAMP) rowno,
concat(id,to_varchar(rowno)) newid from
(SELECT id,
TIMESTAMP,
LAG(TIMESTAMP) OVER (PARTITION BY visit_id ORDER BY TIMESTAMP) as BEFORE_TIME
FROM table_name t
where id = 'XX1X2375'
order by TIMESTAMP_EST)
where BEFORE_TIME is not NULL and time_diff > 30
order by time_diff desc) c
where t.id = c.id
and t.timestamp = c.BEFORE_TIME
;

Amazon Athena LEFT OUTER JOIN query not working as expected

I am trying to do a left ourter join in Athena and my query looks like the following:
SELECT customer.name, orders.price
FROM customer LEFT OUTER JOIN order
ON customer.id = orders.customer_id
WHERE price IS NULL;
Where each customer could only have one order in the orders table at most and there are customers with no order in the orders table at all. So I am expecting to get some number of records where there is a customer in the customer table with no records in orders table which means when I do LEFT OUTER JOIN the price will be NULL. But this query returns 0 every time I run it. I have queries both tables separately and pretty sure there is data in both but not sure why this is returning zero where it works if I remove the price IS NULL. I have also tried price = '' and price IN ('') and none of them works. Has anyone here had a similar experience before? Or is there something wrong with my query that I can not see or identify?
It seems that your query is correct. To validate, I created two CTEs that should match up with your customer and orders table and ran your query against them. When running the query below, it returns a record for customer 3 Ted Johnson who did not have an order.
WITH customer AS (
SELECT 1 AS id, 'John Doe' AS name
UNION
SELECT 2 AS id, 'Jane Smith' AS name
UNION
SELECT 3 AS id, 'Ted Johnson' AS name
),
orders AS (
SELECT 1 AS customer_id, 20 AS price
UNION
SELECT 2 AS customer_id, 15 AS price
)
SELECT customer.name, orders.price
FROM customer LEFT OUTER JOIN orders
ON customer.id = orders.customer_id
WHERE price IS NULL;
I'd suggest running the following queries:
COUNT(DISTINCT id) FROM customers;
COUNT(DISTINCT customer_id) FROM orders;
Based on the results you are seeing, I would expect those counts to match. Perhaps your system is creating a record in the orders table whenever a customer is created with a price of 0.
Probably you can't use where for order table.
SELECT customer.name, order.price
FROM customer LEFT OUTER JOIN order
ON customer.id = orders.customer_id AND order.price IS NULL;

how to find size of database, schema, table in redshift

Team,
my redshift version is:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.735
how to find out database size, tablespace, schema size & table size ?
but below are not working in redshift ( for above version )
SELECT pg_database_size('db_name');
SELECT pg_size_pretty( pg_relation_size('table_name') );
Is there any alternate to find out like oracle ( from DBA_SEGMENTS )
for tble size, i have below query, but not sure about exact menaing of MBYTES. FOR 3rd row, MBYTES = 372. it means 372 MB ?
select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by a.db_id, a.name;
database | schema | table | mbytes | rows
---------------+--------------+------------------+--------+----------
postgres | public | company | 8 | 1
postgres | public | table_data1_1 | 7 | 1
postgres | proj_schema1 | table_data1 | 372 | 33867540
postgres | public | table_data1_2 | 40 | 2000001
(4 rows)
The above answers don't always give correct answers for table space used. AWS support have given this query to use:
SELECT TRIM(pgdb.datname) AS Database,
TRIM(a.name) AS Table,
((b.mbytes/part.total::decimal)*100)::decimal(5,2) AS pct_of_total,
b.mbytes,
b.unsorted_mbytes
FROM stv_tbl_perm a
JOIN pg_database AS pgdb
ON pgdb.oid = a.db_id
JOIN ( SELECT tbl,
SUM( DECODE(unsorted, 1, 1, 0)) AS unsorted_mbytes,
COUNT(*) AS mbytes
FROM stv_blocklist
GROUP BY tbl ) AS b
ON a.id = b.tbl
JOIN ( SELECT SUM(capacity) AS total
FROM stv_partitions
WHERE part_begin = 0 ) AS part
ON 1 = 1
WHERE a.slice = 0
ORDER BY 4 desc, db_id, name;
Yes, mbytes in your example is 372Mb. Here's what I've been using:
select
cast(use2.usename as varchar(50)) as owner,
pgc.oid,
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from
(select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
left join pg_user use2 on (pgc.relowner = use2.usesysid)
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
and pgn.nspowner > 1
join pg_database as pgdb on pgdb.oid = a.db_id
join
(select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;
I'm not sure about grouping by database and scheme, but here's a short way to get usage by table,
SELECT tbl, name, size_mb FROM
(
SELECT tbl, count(*) AS size_mb
FROM stv_blocklist
GROUP BY tbl
)
LEFT JOIN
(select distinct id, name FROM stv_tbl_perm)
ON id = tbl
ORDER BY size_mb DESC
LIMIT 10;
you can checkout this repository, i'm sure you'll find useful stuff there.
https://github.com/awslabs/amazon-redshift-utils
to answer your question you can use this view:
https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminViews/v_space_used_per_tbl.sql
and then query as you like.
e.g: select * from admin.v_space_used_per_tbl;
Modified versions of one of the other answers. This includes database name, schema name, table name, total row count, size on disk and unsorted size:
-- sort by row count
select trim(pgdb.datname) as Database, trim(pgns.nspname) as Schema, trim(a.name) as Table,
c.rows, ((b.mbytes/part.total::decimal)*100)::decimal(5,3) as pct_of_total, b.mbytes, b.unsorted_mbytes
from stv_tbl_perm a
join pg_class as pgtbl on pgtbl.oid = a.id
join pg_namespace as pgns on pgns.oid = pgtbl.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
join (select id, sum(rows) as rows from stv_tbl_perm group by id) c on a.id=c.id
join (select sum(capacity) as total from stv_partitions where part_begin=0) as part on 1=1
where a.slice=0
order by 4 desc, db_id, name;
-- sort by space used
select trim(pgdb.datname) as Database, trim(pgns.nspname) as Schema, trim(a.name) as Table,
c.rows, ((b.mbytes/part.total::decimal)*100)::decimal(5,3) as pct_of_total, b.mbytes, b.unsorted_mbytes
from stv_tbl_perm a
join pg_class as pgtbl on pgtbl.oid = a.id
join pg_namespace as pgns on pgns.oid = pgtbl.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
join (select id, sum(rows) as rows from stv_tbl_perm group by id) c on a.id=c.id
join (select sum(capacity) as total from stv_partitions where part_begin=0) as part on 1=1
where a.slice=0
order by 6 desc, db_id, name;
This query is much easier:
-- List the Top 30 largest tables on your cluster
SELECT
"schema"
,"table" AS table_name
,ROUND((size/1024.0),2) AS "Size in Gigabytes"
,pct_used AS "Physical Disk Used by This Table"
FROM svv_table_info
ORDER BY pct_used DESC
LIMIT 30;
SVV_TABLE_INFO is a Redshift systems table that shows information about user-defined tables (not other system tables) in a Redshift database. The table is only visible to superusers.
To get the size of each table, run the following command on your Redshift cluster:
SELECT "table", size, tbl_rows
FROM SVV_TABLE_INFO
The table column is the table name.
The size column is the size of the table in MB.
The tbl_rows column is the total number of rows in the table, including rows that have been marked for deletion but not yet vacuumed.
Source
Look at SVV_TABLE_INFO Redshift documentation for other interesting columns to retrieve from this system table.
This is what I am using(please change the databasename from 'mydb' to your database name) :
SELECT CAST(use2.usename AS VARCHAR(50)) AS OWNER
,TRIM(pgdb.datname) AS DATABASE
,TRIM(pgn.nspname) AS SCHEMA
,TRIM(a.NAME) AS TABLE
,(b.mbytes) / 1024 AS Gigabytes
,a.ROWS
FROM (
SELECT db_id
,id
,NAME
,SUM(ROWS) AS ROWS
FROM stv_tbl_perm a
GROUP BY db_id
,id
,NAME
) AS a
JOIN pg_class AS pgc ON pgc.oid = a.id
LEFT JOIN pg_user use2 ON (pgc.relowner = use2.usesysid)
JOIN pg_namespace AS pgn ON pgn.oid = pgc.relnamespace
AND pgn.nspowner > 1
JOIN pg_database AS pgdb ON pgdb.oid = a.db_id
JOIN (
SELECT tbl
,COUNT(*) AS mbytes
FROM stv_blocklist
GROUP BY tbl
) b ON a.id = b.tbl
WHERE pgdb.datname = 'mydb'
ORDER BY mbytes DESC
,a.db_id
,a.NAME;
src: https://aboutdatabases.wordpress.com/2015/01/24/amazon-redshift-how-to-get-the-sizes-of-all-tables/