Query exhausted resources on this scale factor - amazon-athena

I am trying to left join a very big table (52 MIllion rows) to a massive table with 11,553,668,111 observations, but just two columns
Simple left join commands err out with "Query exhausted resources at this scale factor."
-- create smaller table to save $$
CREATE TABLE targetsmart_idl_data_mi_pa_maid AS
SELECT targetsmart_idl_data_pa_mi_pa.idl, targetsmart_idl_data_pa_mi_pa.grouping_indicator, targetsmart_idl_data_pa_mi_pa.vb_voterbase_dob, targetsmart_idl_data_pa_mi_pa.vb_voterbase_gender, targetsmart_idl_data_pa_mi_pa.ts_tsmart_urbanicity, targetsmart_idl_data_pa_mi_pa.ts_tsmart_high_school_only_score,
targetsmart_idl_data_pa_mi_pa.ts_tsmart_college_graduate_score, targetsmart_idl_data_pa_mi_pa.ts_tsmart_partisan_score, targetsmart_idl_data_pa_mi_pa.ts_tsmart_presidential_general_turnout_score, targetsmart_idl_data_pa_mi_pa.vb_voterbase_marital_status, targetsmart_idl_data_pa_mi_pa.vb_tsmart_census_id,
targetsmart_idl_data_pa_mi_pa.vb_voterbase_deceased_flag, idl_maid_base.maid
FROM targetsmart_idl_data_pa_mi_pa
LEFT JOIN idl_maid_base
ON targetsmart_idl_data_pa_mi_pa.idl = idl_maid_base.idl

I was able to overcome the issue by having the large table as driving table
For example.
select col1, col2 from table a join table b on a.col1 =b.col1
table a is small with less than 1000 records where as table b has millions of records. The above query error out
Re-write the query as
select col1, col2 from table b join table a on a.col1 =b.col1

Related

How to add column with query folding using snowflake connector

I am trying to add a new column to a power query result that is the result of subtracting one column from another. according to the power bi documentation basic arithmetic is supported with query folding but for some reason it is showing a failure to query fold. I also tried simply adding a column populated with the number 1 and it still was not working. Is there some trick to getting query folding a new column to work on snowflake?
If the computation is made based only on data from source, then it could be computed during table import as SQL Statement:
SELECT col1, col2, col1 + col2 AS computed_total
FROM my_table_name
EDIT:
The problem with this solution is that native SQL statement for snowflake is only supported on PBI desktop and I want to have this stored in a dataflow (so pbi web client) for reusability and other reasons.
Option 1:
Create a view istead of table at source:
CREATE OR REPLACE VIEW my_view
AS
SELECT col1, col2, col1 + col2 AS computed_total
FROM my_table_name;
Option 2:
Add computed column to the table:
ALTER TABLE my_table_name
ADD COLUMN computed_total NUMBER(38,4) AS (col1 + col2);

Join 2 tables results in query timeout

I have a few tables created in AWS Athena under "TestDB". These tables are created by running an AWS Glue crawler through the S3 buckets. I am trying to create a new table by joining 2 existing tables under "TestDB". It is a simple left outer join as follows:
CREATE TABLE IF NOT EXISTS TestTab1
AS (
SELECT *
FROM (
(
SELECT col1, col2, col3, col4
FROM "TestDB"."tab1" a
WHERE a.partition_0 = '10-24-2021'
AND substring(a.datetimestamp, 1, 10) = '2021-10-24'
)
LEFT OUTER JOIN (
SELECT col1, col2, col3,col4
FROM "TestDB"."tab2" b
WHERE b.partition_0 = '10-24-2021'
AND substring(b.datetimestamp,1,10) = '2021-10-24'
)
ON (a.col1 = b.col1)
)
)
The query scans around 5GB worth of data but times out after ~30 mins since that is the timeout limit. Other than requesting an increase in timeout limit, is there any other way to create a join of 2 tables on AWS?
It's very hard to say from the information you provide, but it's probably down to the result becoming very big or an intermediate result becoming big enough for the executors to run out of memory and having to spill to disk.
Does running just the query work? You can also try to run EXPLAIN SELECT … to get the query plan and see if that tells you anything.
Your query is unnecessarily complex with multiple nested SELECT statements. I think Athena's query planner will be smart enough to rewrite it to something like the following, which is easier to read and understand:
CREATE TABLE IF NOT EXISTS TestTab1 AS
SELECT col1, col2, col3, col4
FROM "TestDB"."tab1" a LEFT JOIN "TestDB"."tab2" b USING (col1)
WHERE a.partition_0 = '10-24-2021'
AND b.partition_0 = '10-24-2021'
AND substring(a.datetimestamp, 1, 10) = '2021-10-24'
AND substring(b.datetimestamp, 1, 10) = '2021-10-24'

Informatica - SQ transformation

What will be the expected result of the below.
I have table A with column1,
I'm trying to map column1 to SQ, which has 3 columns - col1, col2 and col3.
Link Column1 to col1,col2 and col3 in SQ. Now when I try to generate SQL query for SQ, what will be the result?
Since OP is waiting for answer and doesnt have informatica to test it out, let me answer to that.
if you connect one column to three columns in SQ and then connect all those three columns to next transformation, then your generated SQL will contain one column repeated thrice from source.
Here are some screenshot from a dummy map i created.
mapping screenshot -
Then here is generate SQL -
SELECT
ITEM.ITEM_NUM, ITEM.ITEM_NUM, ITEM.ITEM_NUM
FROM
ITEM

How do I select rows from an Sqlite table exluding ones from a previous query?

I have an Sqlite table having > 25 million rows. I'd selected 1 million rows randomly from this table using the following code:
# using sqlite3 code
c = cursor.execute("SELECT *
FROM reviews_table WHERE ROWID IN (SELECT ROWID FROM reviews_table ORDER BY RANDOM() LIMIT 1000000) ")
Now, I wish to select another 1 million rows from the table, excluding those rows in the previous query. How would I go about doing this?

Update the values of a column in a dataset with another table

If I have Table A with two columns: ID and Mean, and Table B with a long list of columns including Mean, how can I replace the values of the Mean column in Table B with the IDs that exist in Table A?
I've tried PROC SQL UPDATE and both DATASET MERGE and DATASET UPDATE but they keep adding rows when the number of columns is not equal in both tables.
data want;
merge have1(in=H1) have2(in=H2);
by mergevar;
if H1;
run;
That will guarantee that H2 does not add any rows, unless there are duplicate values for one of the by values. Other conditions can be used as well; if h2; would do about the same thing for the right-hand dataset, and if h1 and h2; would only keep records that come from both tables.
PROC SQL join should also work fairly easily.
proc sql;
create table want as
select A.id, coalesce(B.mean, A.mean)
from A left join B
on A.id=B.id;
quit;