How to do left anti join in AWS Athena DB? I have googled it and i didn't get any help. Or any alternative solution would be appreciated.
I have 2 tables emp and dept and i want to do left anti join with these tables using columns "emp_dept_id" and "dept_id".
I need a query for Athena.
Here is a left anti-join query per your request:
SELECT e.*
FROM Emp e
LEFT JOIN Dept d
ON d.dept_id = e.emp_dept_id
WHERE d.dept_id IS NULL;
Note that you could also express the above using exists logic:
SELECT e.*
FROM Emp e
WHERE NOT EXISTS (
SELECT 1
FROM Dept d
d.dept_id = e.emp_dept_id
);
I have recently implemented Distribution Styles and Sort Keys on a few Redshift Tables after doing the following analysis -
To find out the best candidate for Distribution Styles and Sort keys , I looked into the queries that run on these tables followed by shortlisting 1 column for Sort key & Distribution Key ( Instead of Style ) on the bases of Cardinality of the column ( Degree of Uniqueness of a columns data ) and also ran EXPLAIN query which clearly showed reduction in COST, Width & Row Scanned.
And on the basis of the above, I applied the sort key and dist key on few of the tables which produced the best performance.
Now After implementation , When I was looking into the pct_unsorted for these tables , then these are my observations -
Not All Tables were Auto Vacuumed by redshift , Some Had [BG] Vacuum run where as some others only had [BG] Vacuum Delete run on them. These were the tables with 100% Unsorted data -
Upon Looking into the Vacuum Status from stl_vacuum -
My Questions -
On What grounds does Redshift decide which kind of Vacuum is to be run ( Auto Vacuum )?
Why for some tables is redshift Running VACUUM and for others its running VACUUM DELETE ONLY ?
What is the most Ideal or best practice of VACUUM tables to get data sorted ?
Thanks in Advance.
Queries Run for the above shown results are -
select trim(pgn.nspname) as schema,
trim(a.name) as table, id as tableid,
decode(pgc.reldiststyle,0, 'even',1,det.distkey ,8,'all') as distkey, dist_ratio.ratio::decimal(10,4) as skew,
det.head_sort as "sortkey",
det.n_sortkeys as "#sks", b.mbytes,
decode(b.mbytes,0,0,((b.mbytes/part.total::decimal)*100)::decimal(5,2)) as pct_of_total,
decode(det.max_enc,0,'n','y') as enc, a.rows,
decode( det.n_sortkeys, 0, null, a.unsorted_rows ) as unsorted_rows ,
decode( det.n_sortkeys, 0, null, decode( a.rows,0,0, (a.unsorted_rows::decimal(32)/a.rows)*100) )::decimal(5,2) as pct_unsorted
from (select db_id, id, name, sum(rows) as rows,
sum(rows)-sum(sorted_rows) as unsorted_rows
from stv_tbl_perm a
group by db_id, id, name) as a
join pg_class as pgc on pgc.oid = a.id
join pg_namespace as pgn on pgn.oid = pgc.relnamespace
left outer join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
inner join (select attrelid,
min(case attisdistkey when 't' then attname else null end) as "distkey",
min(case attsortkeyord when 1 then attname else null end ) as head_sort ,
max(attsortkeyord) as n_sortkeys,
max(attencodingtype) as max_enc
from pg_attribute group by 1) as det
on det.attrelid = a.id
inner join ( select tbl, max(mbytes)::decimal(32)/min(mbytes) as ratio
from (select tbl, trim(name) as name, slice, count(*) as mbytes
from svv_diskusage group by tbl, name, slice )
group by tbl, name ) as dist_ratio on a.id = dist_ratio.tbl
join ( select sum(capacity) as total
from stv_partitions where part_begin=0 ) as part on 1=1
where mbytes is not null
and "table" in (<TABLE NAMES>)
order by mbytes desc
&
select xid, a.table_id,b."table", status, rows, sortedrows, blocks, eventtime
from stl_vacuum a
inner join (select table_id,"table" from svv_table_info where "schema" in ('touchpoint','activehealth')
and "table" in (<TABLE NAMES>)
) b on a.table_id = b. table_id
UPDATE -
One interesting observation is , in the svv_table_info ,
For these tables with 100% unsorted data, the vacuum_sort_benefit is 0.
I want to concatenate two SAS data sets, one from 2003 and one from 2013. There is a uniq identifier in both, and I'll only allow allow records to be concatenated if they appears in both.
NB. there is multiple records with the same ID.
Here's some untested code:
proc sql;
create table want as
select * from(
select * from t1 where t1.id in (select t2.id in t2)
union
select * from t2 where t2.id in (select t1.id in t1)) as A;
quit;
I have two tables like below.
Profile : ID
Charac : ID, NAME, DATE
With the above tables, I am trying to get NAME from Charac where we have max date.
I am trying to do a join with proc sql by replicating the answer for mysql like below
proc sql;
create table ggg as
select profile.ID ,T2.NAME
from Profile
left join
( select ID,max(DATE) as max_DATE
from EDW.CHARAC
group by ID
) as T1
on fff.ID = EDW.ID
left join EDW.CHARAC as T2
on T2.ID = T1.max_DATE
order by profile.ID DESC;
quit;
Error
ERROR: Unresolved reference to table/correlation name EDW.
ERROR: Expression using equals (=) has components that are of different data types.
Could it be you intended
on T2.ID = T1.max_DATE
which is probably source of "components that are of different data types" error
to be:
on T2.ID = T1.ID and T2.DATE = T1.max_DATE
that, is - joining on IDs at maximum DATE?
You can't use EDW like that. You need to join
on fff.ID=T1.ID
As far as data types, that probably is because EDW.ID is undefined and thus numeric by default.
I'm doing a count from table1 whose records/rows don't exist in table2
Here is the query:
select count(1) from table1
where not exists (select 1 from table2 where
table1.col1 = table2.col1
and table2.id=1)
I need to see the records that are missing in table2 , whose id in table2=1, and these records should be available in table1. The PK here is col1.
The query returns me 0. But if I do an excel sheet comparing by removing both the tables to excel. I can find 1591 records that are missing from table1 and are available in table2.
Your query is working fine.
You query finds records that EXISTS in table1 but not in table2
You have found with excel records that does NOT EXISTS in table1 and EXISTS in table2
If you'd like to find these records with SQL than your query should be:
select count(1) from table2
where table2.id=1 and table2.col1 not in (select col1 from table1)
or with not exists version of this query:
select count(1) from table2
where table2.id=1 and
not exists (select 1 from table1 where table1.col1=table2.col1)
I didn't test the queries.