Join 2 tables results in query timeout - amazon-web-services

I have a few tables created in AWS Athena under "TestDB". These tables are created by running an AWS Glue crawler through the S3 buckets. I am trying to create a new table by joining 2 existing tables under "TestDB". It is a simple left outer join as follows:
CREATE TABLE IF NOT EXISTS TestTab1
AS (
SELECT *
FROM (
(
SELECT col1, col2, col3, col4
FROM "TestDB"."tab1" a
WHERE a.partition_0 = '10-24-2021'
AND substring(a.datetimestamp, 1, 10) = '2021-10-24'
)
LEFT OUTER JOIN (
SELECT col1, col2, col3,col4
FROM "TestDB"."tab2" b
WHERE b.partition_0 = '10-24-2021'
AND substring(b.datetimestamp,1,10) = '2021-10-24'
)
ON (a.col1 = b.col1)
)
)
The query scans around 5GB worth of data but times out after ~30 mins since that is the timeout limit. Other than requesting an increase in timeout limit, is there any other way to create a join of 2 tables on AWS?

It's very hard to say from the information you provide, but it's probably down to the result becoming very big or an intermediate result becoming big enough for the executors to run out of memory and having to spill to disk.
Does running just the query work? You can also try to run EXPLAIN SELECT … to get the query plan and see if that tells you anything.
Your query is unnecessarily complex with multiple nested SELECT statements. I think Athena's query planner will be smart enough to rewrite it to something like the following, which is easier to read and understand:
CREATE TABLE IF NOT EXISTS TestTab1 AS
SELECT col1, col2, col3, col4
FROM "TestDB"."tab1" a LEFT JOIN "TestDB"."tab2" b USING (col1)
WHERE a.partition_0 = '10-24-2021'
AND b.partition_0 = '10-24-2021'
AND substring(a.datetimestamp, 1, 10) = '2021-10-24'
AND substring(b.datetimestamp, 1, 10) = '2021-10-24'

Related

How to add column with query folding using snowflake connector

I am trying to add a new column to a power query result that is the result of subtracting one column from another. according to the power bi documentation basic arithmetic is supported with query folding but for some reason it is showing a failure to query fold. I also tried simply adding a column populated with the number 1 and it still was not working. Is there some trick to getting query folding a new column to work on snowflake?
If the computation is made based only on data from source, then it could be computed during table import as SQL Statement:
SELECT col1, col2, col1 + col2 AS computed_total
FROM my_table_name
EDIT:
The problem with this solution is that native SQL statement for snowflake is only supported on PBI desktop and I want to have this stored in a dataflow (so pbi web client) for reusability and other reasons.
Option 1:
Create a view istead of table at source:
CREATE OR REPLACE VIEW my_view
AS
SELECT col1, col2, col1 + col2 AS computed_total
FROM my_table_name;
Option 2:
Add computed column to the table:
ALTER TABLE my_table_name
ADD COLUMN computed_total NUMBER(38,4) AS (col1 + col2);

Retrieving the row with the greatest timestamp in questDB

I'm currently running QuestDB 6.1.2 on linux. How do I get the row with maximum value from a table? I have tried the following on a test table with around 5 million rows:
select * from table where cast(timestamp as symbol) in (select cast(max(timestamp) as symbol) from table );
select * from table inner join (select max(timestamp) mm from table ) on timestamp >= mm
select * from table where timestamp = max(timestamp)
select * from table where timestamp = (select max(timestamp) from table )
where 1 is correct but runs in ~5s, 2 is correct and runs in ~500ms but looks unnecessarily verbose for a query, 3 compiles but returns an empty table, and 4 is incorrect syntax although that's how sql usually does it
select * from table limit -1 works. QuestDB returns rows sorted by timestamp as default, and limit -1 takes the last row, which happens to be the row with the greatest timestamp. To be explicit about ordering by timestamp, select * from table order by timestamp limit -1 could be used instead. This query runs in around 300-400ms on the same table.
As a side note, the third query using timestamp=max(timestamp) doesn't work yet since QuestDB does not support subqueries in where yet (questDB 6.1.2).

Query exhausted resources on this scale factor

I am trying to left join a very big table (52 MIllion rows) to a massive table with 11,553,668,111 observations, but just two columns
Simple left join commands err out with "Query exhausted resources at this scale factor."
-- create smaller table to save $$
CREATE TABLE targetsmart_idl_data_mi_pa_maid AS
SELECT targetsmart_idl_data_pa_mi_pa.idl, targetsmart_idl_data_pa_mi_pa.grouping_indicator, targetsmart_idl_data_pa_mi_pa.vb_voterbase_dob, targetsmart_idl_data_pa_mi_pa.vb_voterbase_gender, targetsmart_idl_data_pa_mi_pa.ts_tsmart_urbanicity, targetsmart_idl_data_pa_mi_pa.ts_tsmart_high_school_only_score,
targetsmart_idl_data_pa_mi_pa.ts_tsmart_college_graduate_score, targetsmart_idl_data_pa_mi_pa.ts_tsmart_partisan_score, targetsmart_idl_data_pa_mi_pa.ts_tsmart_presidential_general_turnout_score, targetsmart_idl_data_pa_mi_pa.vb_voterbase_marital_status, targetsmart_idl_data_pa_mi_pa.vb_tsmart_census_id,
targetsmart_idl_data_pa_mi_pa.vb_voterbase_deceased_flag, idl_maid_base.maid
FROM targetsmart_idl_data_pa_mi_pa
LEFT JOIN idl_maid_base
ON targetsmart_idl_data_pa_mi_pa.idl = idl_maid_base.idl
I was able to overcome the issue by having the large table as driving table
For example.
select col1, col2 from table a join table b on a.col1 =b.col1
table a is small with less than 1000 records where as table b has millions of records. The above query error out
Re-write the query as
select col1, col2 from table b join table a on a.col1 =b.col1

AWS Glue joining

I am new to AWS Glue and trying to join two Redshift SQL queries but not sure how we can have only selected fields as my main table has more than 1000 fields.
Below is the query which I am trying to make in Glue.
SELECT v.col1,
v.col2,
s.col3
FROM
(
SELECT col1,
col2
FROM t1
WHERE col1 > 0
) v
LEFT JOIN
(
SELECT col1,
col3
FROM t2
WHERE col1 > 0
GROUP BY col1
) s
ON v.col1 = s.col1
If you are writing in Python, I would either use Spark sql or use pysparks join functions.
For Spark SQL
1) Convert to a Apache Spark DataFrame using the toDF() function.
2) Make the Spark Data Frame Spark SQL Table using createOrReplaceTempView().
Then run sql and to what you posted above.
OR
Use PySpark
left_join = t1.join(t2, t1.col1 == t2.name,how='left')
left_join.filter(col('col1' > 0)) # Then filter afterwards
Would that work for you?

How to query historical table size of database in Redshift to determine database size growth

I want to project forward the size of my Amazon Redshift tables because I'm planning to expand my Redshift cluster size.
I know how to query the table size for today (see query below) but how can I measure the growth of my table sizes over time without make an ETL job to make snapshot day-by-day table size?
-- Capture table sizes
select
trim(pgdb.datname) as Database,
trim(pgn.nspname) as Schema,
trim(a.name) as Table,
b.mbytes,
a.rows
from (
select db_id, id, name, sum(rows) as rows
from stv_tbl_perm a
group by db_id, id, name
) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
join pg_database as pgdb on pgdb.oid = a.db_id
join (
select tbl, count(*) as mbytes
from stv_blocklist
group by tbl
) b on a.id = b.tbl
order by mbytes desc, a.db_id, a.name;
There is no historical table size information retained by Amazon Redshift. You would need to run a query on a regular basis, such as the one in your question.
You could wrap the query in an INSERT statement and run it on a weekly basis, inserting the results into a table. This way, you'll have historical table size information for each table each week that you can use to predict future growth.
It would be worth doing a VACUUM prior to such measurements, to remove deleted rows from storage.
Following metrics is available in cloudwatch
RedshiftManagedStorageTotalCapacity (m1)
PercentageDiskSpaceUsed (m2).
Create a cloudwatch math expression m1*m2/100 to get this data for the past 3 months.