Amazon RedShift : How to find Database size - amazon-web-services

There have been many google results which answer this. However, none seemed to work for me. Hence I am creating this question and answering it for my own future reference as well as for any one else who might reach this thread via Google.

Here is the Query:
select sum(mbytes)/1024, database from (
select trim(pgdb.datname) as Database,
trim(a.name) as Table, b.mbytes
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
where a.slice=0
order by db_id, name)
group by database;
Output:
?column? | database
----------+---------------
62 | db1
33 | db2
33 | db3
2 | db4
37 | db5
34 | db6
35 | db7
59 | db8
2 | db9
26 | db10
2 | db11
72 | db12
36 | db13
41 | db14
Note: Above numbers are in GB

When looking for disk space usage for tables, use this query:
select
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from (
select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (
select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;

Related

Redshift Auto Vacuum

I have recently implemented Distribution Styles and Sort Keys on a few Redshift Tables after doing the following analysis -
To find out the best candidate for Distribution Styles and Sort keys , I looked into the queries that run on these tables followed by shortlisting 1 column for Sort key & Distribution Key ( Instead of Style ) on the bases of Cardinality of the column ( Degree of Uniqueness of a columns data ) and also ran EXPLAIN query which clearly showed reduction in COST, Width & Row Scanned.
And on the basis of the above, I applied the sort key and dist key on few of the tables which produced the best performance.
Now After implementation , When I was looking into the pct_unsorted for these tables , then these are my observations -
Not All Tables were Auto Vacuumed by redshift , Some Had [BG] Vacuum run where as some others only had [BG] Vacuum Delete run on them. These were the tables with 100% Unsorted data -
Upon Looking into the Vacuum Status from stl_vacuum -
My Questions -
On What grounds does Redshift decide which kind of Vacuum is to be run ( Auto Vacuum )?
Why for some tables is redshift Running VACUUM and for others its running VACUUM DELETE ONLY ?
What is the most Ideal or best practice of VACUUM tables to get data sorted ?
Thanks in Advance.
Queries Run for the above shown results are -
select trim(pgn.nspname) as schema,
trim(a.name) as table, id as tableid,
decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey, dist_ratio.ratio::decimal(10,4) as skew,
det.head_sort as "sortkey",
det.n_sortkeys as "#sks", b.mbytes,
decode(b.mbytes,0,0,((b.mbytes/part.total::decimal)*100)::decimal(5,2)) as pct_of_total,
decode(det.max_enc,0,'n','y') as enc, a.rows,
decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows ,
decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted
from (select db_id, id, name, sum(rows) as rows,
sum(rows)-sum(sorted_rows) as unsorted_rows
from stv_tbl_perm a
group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
inner join (select attrelid,
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort ,
max(attsortkeyord) as n_sortkeys,
max(attencodingtype) as max_enc
from pg_attribute group by 1) as det
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice )
group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null
and "table" in (<TABLE NAMES>)
order by mbytes desc
&
select xid, a.table_id,b."table", status, rows, sortedrows, blocks, eventtime
from stl_vacuum a
inner join (select table_id,"table" from svv_table_info where "schema" in ('touchpoint','activehealth')
and "table" in (<TABLE NAMES>)
) b on a.table_id = b. table_id
UPDATE -
One interesting observation is , in the svv_table_info ,
For these tables with 100% unsorted data, the vacuum_sort_benefit is 0.

PowerBI DAX – subquery to the same table

I have a table, let's call it Products with columns:
Id
ProductId
Version
some other columns…
Id column is the primary key, and ProductId groups rows. Now I want to view distinct values of ProductId where Version is highest.
I.e. From data set:
Id | ProductId | Version | ...
100 | 1 | 0 | ...
101 | 2 | 0 | ...
102 | 2 | 1 | ...
103 | 2 | 2 | ...
I need to get:
Id | ProductId | Version | ...
100 | 1 | 0 | ...
103 | 2 | 2 | ...
In SQL I would write:
SELECT Id, ProductId, Version, OtherColumns
FROM Products p1
WHERE NOT EXISTS
(SELECT 1
FROM Products p2
WHERE p2.ProductId = p1.ProductId
AND p2.Version > p1.Version)
But I have no idea how to express this in DAX. Is this approach with subqueries inapplicable in PowerBI?
Another approach is to first construct a virtual table of product_ids and their latest versions, and then use this table to filter the original table:
EVALUATE
VAR Latest_Product_Versions =
ADDCOLUMNS(
VALUES('Product'[Product_Id]),
"Latest Version", CALCULATE(MAX('Product'[Version])))
RETURN
CALCULATETABLE(
'Product',
TREATAS(Latest_Product_Versions, 'Product'[Product_Id], 'Product'[Version]))
Result:
The benefit of this approach is optimal query execution plan.
You can use SUMMARIZECOLUMNS to group ProductId and MAX Version.
Then use ADDCOLUMNS to add the corresponding Id number(s), using a filter on the Products table for the matching ProductId and Version. I've used CONCATENATEX here, so that if multiple Id values have the same Product / MAX Version combination, all Id values will be returned, as a list.
EVALUATE
ADDCOLUMNS (
SUMMARIZECOLUMNS (
Products[ProductId],
"#Max Version",
MAX ( Products[Version] )
),
"#Max Version Id",
CONCATENATEX (
FILTER (
Products,
Products[Version] = [#Max Version] && Products[ProductId] = EARLIER ( Products[ProductId] )
),
Products[Id],
","
)
)

Postgres - How to Join Tables without duplicates

I'm working on a project that locally used SQLite, now when moving to PostGres (On Heroku) my query reported an error "r.social must appear in the GROUP BY clause or be used in an aggregate function"
The original query is:
SELECT DISTINCT c.name, r.social, c.description, p.price
FROM cryptomodels_coin c
LEFT JOIN cryptomodels_coinprice p
ON p.coin_id = c.name
LEFT JOIN cryptomodels_CoinRating r
ON r.coin_id = c.name
GROUP BY c.name
Which works fine locally, with one unique row returned for each coin
When I added this to the PostGres environment, it threw the aggregate function error mentioned above - I managed to resolve this by adding all columns to the "Group by" clause, as seen below:
SELECT DISTINCT c.name, r.social, c.description, p.price
FROM cryptomodels_coin c
LEFT JOIN cryptomodels_coinprice p
ON p.coin_id = c.name
LEFT JOIN cryptomodels_CoinRating r
ON r.coin_id = c.name
GROUP BY c.name, r.social, c.description, p.price
The issue is that I now have duplicate rows for each coin
I've done a fair bit of reading and tried numerous solutions, some of which throw errors and others still result in duplicate rows, really not sure how to proceed, thank you for any assistance
EDIT for additional information:
Each coin has numerous prices and numerous ratings, with the cryptomodels_coin table being referenced by the other tables by using it's name as "coin_id" the so three coins for example:
Coin table:
| Name |
--------
| 0X |
| XSV |
| BTC |
Price table:
| Coin_id | Price |
-------------------
| 0X | 43.2 |
| XSV | 20.0 |
| BTC | 99999|
Rating table:
| Coin_id | Social|
-------------------
| 0X | 20,000|
| XSV | 12,000|
| BTC | 5,0000|
EDIT 2:
CREATE TABLE "cryptomodels_coin" (
"name" varchar(200) NOT NULL PRIMARY KEY,
"description" text NOT NULL);
CREATE TABLE "cryptomodels_coinprice" (
"id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"price" real NULL,
"coin_id" varchar(200) NOT NULL REFERENCES "cryptomodels_coin" ("name") );
CREATE TABLE "cryptomodels_coinrating" (
"id" integer NOT NULL PRIMARY KEY AUTOINCREMENT,
"social" text NULL, "coin_id" varchar(200) NOT NULL REFERENCES "cryptomodels_coin" ("name"));
Added SQLFiddle:
http://sqlfiddle.com/#!15/9fcff/1
Thanks!
I guess something like this would eliminate duplicates as you wish:
SELECT c.name AS name,
r.social AS social,
c.description AS description,
SUM(p.price) AS price
FROM cryptomodels_coin c
LEFT JOIN cryptomodels_coinprice p ON p.coin_id = c.name
LEFT JOIN cryptomodels_CoinRating r ON r.coin_id = c.name
GROUP BY c.name,r.social,c.description

how to find size of database, schema, table in redshift

Team,
my redshift version is:
PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.735
how to find out database size, tablespace, schema size & table size ?
but below are not working in redshift ( for above version )
SELECT pg_database_size('db_name');
SELECT pg_size_pretty( pg_relation_size('table_name') );
Is there any alternate to find out like oracle ( from DBA_SEGMENTS )
for tble size, i have below query, but not sure about exact menaing of MBYTES. FOR 3rd row, MBYTES = 372. it means 372 MB ?
select trim(pgdb.datname) as Database, trim(pgn.nspname) as Schema,
trim(a.name) as Table, b.mbytes, a.rows
from ( select db_id, id, name, sum(rows) as rows from stv_tbl_perm a group by db_id, id, name ) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
order by a.db_id, a.name;
database | schema | table | mbytes | rows
---------------+--------------+------------------+--------+----------
postgres | public | company | 8 | 1
postgres | public | table_data1_1 | 7 | 1
postgres | proj_schema1 | table_data1 | 372 | 33867540
postgres | public | table_data1_2 | 40 | 2000001
(4 rows)
The above answers don't always give correct answers for table space used. AWS support have given this query to use:
SELECT TRIM(pgdb.datname) AS Database,
TRIM(a.name) AS Table,
((b.mbytes/part.total::decimal)*100)::decimal(5,2) AS pct_of_total,
b.mbytes,
b.unsorted_mbytes
FROM stv_tbl_perm a
JOIN pg_database AS pgdb
ON pgdb.oid = a.db_id
JOIN ( SELECT tbl,
SUM( DECODE(unsorted, 1, 1, 0)) AS unsorted_mbytes,
COUNT(*) AS mbytes
FROM stv_blocklist
GROUP BY tbl ) AS b
ON a.id = b.tbl
JOIN ( SELECT SUM(capacity) AS total
FROM stv_partitions
WHERE part_begin = 0 ) AS part
ON 1 = 1
WHERE a.slice = 0
ORDER BY 4 desc, db_id, name;
Yes, mbytes in your example is 372Mb. Here's what I've been using:
select
cast(use2.usename as varchar(50)) as owner,
pgc.oid,
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from
(select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
left join pg_user use2 on (pgc.relowner = use2.usesysid)
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
and pgn.nspowner > 1
join pg_database as pgdb on pgdb.oid = a.db_id
join
(select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;
I'm not sure about grouping by database and scheme, but here's a short way to get usage by table,
SELECT tbl, name, size_mb FROM
(
SELECT tbl, count(*) AS size_mb
FROM stv_blocklist
GROUP BY tbl
)
LEFT JOIN
(select distinct id, name FROM stv_tbl_perm)
ON id = tbl
ORDER BY size_mb DESC
LIMIT 10;
you can checkout this repository, i'm sure you'll find useful stuff there.
https://github.com/awslabs/amazon-redshift-utils
to answer your question you can use this view:
https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminViews/v_space_used_per_tbl.sql
and then query as you like.
e.g: select * from admin.v_space_used_per_tbl;
Modified versions of one of the other answers. This includes database name, schema name, table name, total row count, size on disk and unsorted size:
-- sort by row count
select trim(pgdb.datname) as Database, trim(pgns.nspname) as Schema, trim(a.name) as Table,
c.rows, ((b.mbytes/part.total::decimal)*100)::decimal(5,3) as pct_of_total, b.mbytes, b.unsorted_mbytes
from stv_tbl_perm a
join pg_class as pgtbl on pgtbl.oid = a.id
join pg_namespace as pgns on pgns.oid = pgtbl.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
join (select id, sum(rows) as rows from stv_tbl_perm group by id) c on a.id=c.id
join (select sum(capacity) as total from stv_partitions where part_begin=0) as part on 1=1
where a.slice=0
order by 4 desc, db_id, name;
-- sort by space used
select trim(pgdb.datname) as Database, trim(pgns.nspname) as Schema, trim(a.name) as Table,
c.rows, ((b.mbytes/part.total::decimal)*100)::decimal(5,3) as pct_of_total, b.mbytes, b.unsorted_mbytes
from stv_tbl_perm a
join pg_class as pgtbl on pgtbl.oid = a.id
join pg_namespace as pgns on pgns.oid = pgtbl.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, sum(decode(unsorted, 1, 1, 0)) as unsorted_mbytes, count(*) as mbytes from stv_blocklist group by tbl) b on a.id=b.tbl
join (select id, sum(rows) as rows from stv_tbl_perm group by id) c on a.id=c.id
join (select sum(capacity) as total from stv_partitions where part_begin=0) as part on 1=1
where a.slice=0
order by 6 desc, db_id, name;
This query is much easier:
-- List the Top 30 largest tables on your cluster
SELECT
"schema"
,"table" AS table_name
,ROUND((size/1024.0),2) AS "Size in Gigabytes"
,pct_used AS "Physical Disk Used by This Table"
FROM svv_table_info
ORDER BY pct_used DESC
LIMIT 30;
SVV_TABLE_INFO is a Redshift systems table that shows information about user-defined tables (not other system tables) in a Redshift database. The table is only visible to superusers.
To get the size of each table, run the following command on your Redshift cluster:
SELECT "table", size, tbl_rows
FROM SVV_TABLE_INFO
The table column is the table name.
The size column is the size of the table in MB.
The tbl_rows column is the total number of rows in the table, including rows that have been marked for deletion but not yet vacuumed.
Source
Look at SVV_TABLE_INFO Redshift documentation for other interesting columns to retrieve from this system table.
This is what I am using(please change the databasename from 'mydb' to your database name) :
SELECT CAST(use2.usename AS VARCHAR(50)) AS OWNER
,TRIM(pgdb.datname) AS DATABASE
,TRIM(pgn.nspname) AS SCHEMA
,TRIM(a.NAME) AS TABLE
,(b.mbytes) / 1024 AS Gigabytes
,a.ROWS
FROM (
SELECT db_id
,id
,NAME
,SUM(ROWS) AS ROWS
FROM stv_tbl_perm a
GROUP BY db_id
,id
,NAME
) AS a
JOIN pg_class AS pgc ON pgc.oid = a.id
LEFT JOIN pg_user use2 ON (pgc.relowner = use2.usesysid)
JOIN pg_namespace AS pgn ON pgn.oid = pgc.relnamespace
AND pgn.nspowner > 1
JOIN pg_database AS pgdb ON pgdb.oid = a.db_id
JOIN (
SELECT tbl
,COUNT(*) AS mbytes
FROM stv_blocklist
GROUP BY tbl
) b ON a.id = b.tbl
WHERE pgdb.datname = 'mydb'
ORDER BY mbytes DESC
,a.db_id
,a.NAME;
src: https://aboutdatabases.wordpress.com/2015/01/24/amazon-redshift-how-to-get-the-sizes-of-all-tables/

Compare Tables in BigQuery

How would I compare two tables (Table1 and Table2) and find all the new entries or changes in Table2.
Using SQL Server I can use
Select * from Table1
Except
Select * from Table2
Here a sample of what I want
Table1
A | 1
B | 2
C | 3
Table2
A | 1
B | 2
C | 2
D | 4
So, if I comparing the two tables I want my results to show me the following
C | 2
D | 4
I tried a few statements with no luck.
Now that I have your actual sample dataset, I can write a query that finds every domain in one table that is not on the other table:
https://bigquery.cloud.google.com/table/inbound-acolyte-377:demo.1024 has 24,729,816 rows. https://bigquery.cloud.google.com/table/inbound-acolyte-377:demo.1025 has 24,732,640 rows.
Let's look at everything in 1025 that is not in 1024:
SELECT a.domain
FROM [inbound-acolyte-377:demo.1025] a
LEFT OUTER JOIN EACH [inbound-acolyte-377:demo.1024] b
ON a.domain = b.domain
WHERE b.domain IS NULL
Result: 39,629 rows.
(8.1s elapsed, 2.04 GB processed)
To get the differences (given that tkey is your unique row identifier):
SELECT a.tkey, a.name, b.name
FROM [your.tableold] a
JOIN EACH [your.tablenew] b
ON a.tkey = b.tkey
WHERE a.name != b.name
LIMIT 100
For the new rows, one way is the one you proposed:
SELECT col1, col2
FROM table2
WHERE col1 NOT IN
(SELECT col1 FROM Table1)
(you'll have to switch to a JOIN EACH when Table1 gets too large)