Debug SparkSQL Query - amazon-web-services

What are some way I can debug through a sparksql query?
I have defined a dataframe with a sparpksql query and I have included show(1), but the query continues to run very long. Could anyone provide pointers? Thank you!
select t1.id
from table1 t1
join table2 t2 on t1.id = t2.cd
where t1.product = 'I'
and not exists
(select * from table3 t3
where t3.id = t1.id
and t3.year = t1.year
and a3.month = t1.month
and t3.day = t1.day
and t3.code in ('4321','5604'))
and not exists
(select MAX(status) from table4 t4
where t4.id = t1.id
and t4.year = t1.year
and t4.month = t1.month
and t4.day = t1.day
having max(status) > 3)
""").show(1)
list_of_id_df.createOrReplaceTempView("list_of_id")
list_of_id_df.show(1)
print("Done")

Related

Redshift Error when executing the delete script with EXISTS function. The Select runs fine for this query

This Redshift query fails -
DELETE FROM TBL_1 stg
WHERE EXISTS (
WITH CCDA as (
SELECT
row_number() OVER (PARTITION BY emp_id,customer_id ORDER BY seq_num desc) rn
, *
FROM TBL_2
WHERE end_dt > (SELECT max(end_dt) FROM TBL_3)
)
SELECT emp_id,customer_id FROM CCDA WHERE rn = 1
AND stg.emp_id = CCDA.emp_id
AND stg.customer_id = CCDA.customer_id
);
Error: Invalid operation: syntax error at or near "stg"
However, the below query runs fine -
SELECT * FROM TBL_1 stg
WHERE EXISTS (
WITH CCDA as (
SELECT
row_number() OVER (PARTITION BY emp_id,customer_id ORDER BY seq_num desc) rn
, *
FROM TBL_2
WHERE end_dt > (SELECT max(end_dt) FROM TBL_3)
)
SELECT emp_id,customer_id FROM CCDA WHERE rn = 1
AND stg.emp_id = CCDA.emp_id
AND stg.customer_id = CCDA.customer_id
);
Am I missing something?
You cannot use an alias in a DELETE statement for the target table. "stg" cannot be used as the alias and this is why you are getting this error.
Also to reference other tables in a DELETE statement you need to use the USING clause.
See: https://docs.aws.amazon.com/redshift/latest/dg/r_DELETE.html
A quick stab of what this would look like (untested):
WITH CCDA as (
SELECT
row_number() OVER (PARTITION BY emp_id,customer_id ORDER BY seq_num desc) rn
, *
FROM TBL_2
WHERE end_dt > (SELECT max(end_dt) FROM TBL_3)
)
DELETE FROM TBL_1
USING CCDA
WHERE CCDA.rn = 1
AND TBL_1.emp_id = CCDA.emp_id
AND TBL_1.customer_id = CCDA.customer_id
;

How do i get table name and column name using sequence name in postgres?

I know the sequence name but I don't know the table name and column name of it.
So please help me how to find it.
Slight improvisation using #a_horse_with_no_name's answer. The below query will also give the schema name and sequence name. So, this gives you the data with schema,table,column and sequence names as single result
select ts.nspname as object_schema,
tbl.relname as table_name,
col.attname as column_name,
s.relname as sequence_name
from pg_class s
join pg_namespace sn on sn.oid = s.relnamespace
join pg_depend d on d.refobjid = s.oid and d.refclassid='pg_class'::regclass
join pg_attrdef ad on ad.oid = d.objid and d.classid = 'pg_attrdef'::regclass
join pg_attribute col on col.attrelid = ad.adrelid and col.attnum = ad.adnum
join pg_class tbl on tbl.oid = ad.adrelid
join pg_namespace ts on ts.oid = tbl.relnamespace
where s.relkind = 'S'
-- and s.relname = 'your_sequence_name_here'
and d.deptype in ('a', 'n');
Assuming it's a sequence that is owned by a column e.g. because the column is defined as serial or identity then you can get that information by looking at pg_depend and join that to pg_class (and others)
select tbl.relname as table_name,
col.attname as column_name
from pg_class s
join pg_namespace sn on sn.oid = s.relnamespace
join pg_depend d on d.refobjid = s.oid and d.refclassid='pg_class'::regclass
join pg_attrdef ad on ad.oid = d.objid and d.classid = 'pg_attrdef'::regclass
join pg_attribute col on col.attrelid = ad.adrelid and col.attnum = ad.adnum
join pg_class tbl on tbl.oid = ad.adrelid
join pg_namespace ts on ts.oid = tbl.relnamespace
where s.relkind = 'S'
and s.relname = 'your_sequence_name_her'
and d.deptype in ('a', 'n');
Another option is to look at the default value of the columns and check if it contains the sequence name:
select tbl.relname as table_name,
col.attname as column_name
from pg_attrdef ad
join pg_attribute col on col.attrelid = ad.adrelid and col.attnum = ad.adnum
join pg_class tbl on tbl.oid = ad.adrelid
where pg_get_expr(ad.adbin, ad.adrelid) like '%your_sequence_name_here%'
This would also work for sequences that are not owned by a column.

Extraneous Django DB Queries on user.has_perms(perms)

Looking at the SQL queries run when user.has_perms(perms) is called, I see:
SELECT "auth_permission"."id",
"auth_permission"."name",
"auth_permission"."content_type_id",
"auth_permission"."codename",
"django_content_type"."id",
"django_content_type"."name",
"django_content_type"."app_label",
"django_content_type"."model"
FROM "auth_permission"
inner join "auth_user_user_permissions"
ON ( "auth_permission"."id" =
"auth_user_user_permissions"."permission_id" )
inner join "django_content_type"
ON ( "auth_permission"."content_type_id" =
"django_content_type"."id" )
WHERE "auth_user_user_permissions"."user_id" = %s
ORDER BY "django_content_type"."app_label" ASC,
"django_content_type"."model" ASC,
"auth_permission"."codename" ASC
and:
SELECT "django_content_type"."app_label",
"auth_permission"."codename"
FROM "auth_permission"
inner join "auth_group_permissions"
ON ( "auth_permission"."id" =
"auth_group_permissions"."permission_id" )
inner join "auth_group"
ON ( "auth_group_permissions"."group_id" = "auth_group"."id" )
inner join "auth_user_groups"
ON ( "auth_group"."id" = "auth_user_groups"."group_id" )
left outer join "django_content_type"
ON ( "auth_permission"."content_type_id" =
"django_content_type"."id" )
WHERE "auth_user_groups"."user_id" = %s
my questions are:
What exactly are these queries doing?
Why are they run on
every request? Is there some way to cache these results?

From Select in doctrine 2

How do I do this in doctrine2 QB or DQL.
SELECT * FROM
(
select * from my_table order by timestamp desc
) as my_table_tmp
group by catid
order by nid desc
I think your query is the same as:
SELECT *
FROM my_table
GROUP BY catid
HAVING timestamp = MAX(timestamp)
ORDER BY nid DESC
;
If it is correct, then you should be able to do:
$qb->select('e')
->from('My\Entities\Table', 'e')
->groupBy('e.catid')
->having('e.timestamp = MAX(e.timestamp)')
->orderBy('nid', 'DESC')
;
Or, directly using DQL:
SELECT e
FROM My\Entities\Table e
GROUP BY e.catid
HAVING e.timestamp = MAX(e.timestamp)
ORDER BY e.nid DESC
;
Hope this helps and works! ;)

Need help on Regular expression to extract sql sub query

i am new to regx...i want to get the subquery from given query using regular expression.
For example i have query like following
Select * from (
select * from Table_A where ID = 90
UNION
select * from Table_B where ID = 90
) as SUBQUERY left join TABL_ABC abc ON (abc.id = SUBQUERY.id)
now i want my regular expression to match following lines only:
select * from Table_A where ID = 90
UNION
select * from Table_B where ID = 90
Please help me, Thank you in advance...
If it is a simple subquery without additional braces, you can just use this regexp
/\(\s*(select[^)]+)\)/i
<?php
$sql = 'Select * from ( select * from Table_A where ID = 90 UNION select * from Table_B where ID = 90 ) as SUBQUERY left join TABL_ABC abc ON (abc.id = SUBQUERY.id)';
if( preg_match('/\(\s*(select.+)\) as /iU', $sql, $matched ) ){
$subquery = trim( $matched[1] );
var_dump( $subquery );
}