Dynamic task execution in Airflow - airflow-scheduler

Consider I have a airflow DAG contains three tasks t1,t2,t3. The flow of the DAG would be
t1>>t2>>t3.
I want to find a way to change the starting task dynamically. If I give t1 in runtime the DAG should be triggered from t1. If I give t2, then t1 should be skipped and DAG execution should start from t2. Is there a way to do this in airflow?

There is a way but not as you describe it.
add t0 as BranchPythonOperator which will decide if continue to t1 or skip it and continue to t2.
def choose(ti):
if something:
return 't1'
return 't2'
t0 = BranchPythonOperator(
task_id='t0',
python_callable=choose)
t1 = DummyOpeartor(task_id='t1')
t2 = DummyOpeartor(task_id='t2', trigger_rule='one_success')
t3 = DummyOpeartor(task_id='t3')
t0 >> [t1, t2] >> t3
t1 >> t2

Related

Creating efficiently a dictionary in pyspark using list derived from df columns

I created this function in python using pandas dataframe, and I'd like to use it also in spark.
What I'm doing with this function is :
converting the df column to a list ( t1 )
converting the unique values of the column to a list ( t2 )
creating a list for each unique value of each feature ( t ). this list takes value 1 when the unique value is present in t1, 0 otherwise.
at the end the result is a dictionary with the unique values of each feature as key and as argument a list with value 1 when the key (the unique value) appears and 0 otherwise.
feat_list is just a list with all the column names.
def binary_dict(pandas_df, feat_list):
dict_feature = dict()
for col in feat_list:
t1 = pandas_df[col].tolist()
t2 = pandas_df[col].unique().tolist()
for value in t2:
t = []
for i in range (0, len(t1)):
if value == t1[i]:
t.append(1)
else:
t.append(0)
cc = str(col)
vv = "_" + str(value)
cv = cc + vv
dict_feature[cv] = t
return dict_feature
I tried using
t1 = df.select("col_name").rdd.flatMap(list).collect()
for creating t1 but it took over 20 minutes to create the list for a single column. I got something like 100 columns. Is there a way to convert this function to spark efficiently?
Thanks everyone for the answers!
PS: I'm using synapse analytics by azure/microsoft, Python 3.8 and pyspark 3.1.

Django queryset how to aggregate (ArrayAgg) over queryset with union?

from django.contrib.postgres.aggregates import ArrayAgg
t1= Table1.objects.values('id')
t2= Table2.objects.values('id')
t3= Table3.objects.values('id')
t = t1.union(t2, t3)
t.aggregate(id1=ArrayAgg('id'))
This raises error
{ProgramingError} column "__col1" does not exist
Equivalent raw SQL
SELECT array_agg(a.id) from
(
SELECT id FROM table1
UNION
SELECT id FROM table2
UNION
SELECT id FROM table3
) as a
The aggregate function try to find column __col1, but it doesn't exist. So explicitly name the columns with F() objects and use aggregate function:
t1= Table1.objects.values(__col1=F('id'))
t2= Table2.objects.values(__col1=F('id'))
t3= Table3.objects.values(__col1=F('id'))
t = t1.union(t2, t3)
t.aggregate(Avg('id'))

Django: weird behavior of foreign key lookup and annotate F expression

class Sentence(Model):
name = CharField()
class Tokens(Model):
token = CharField()
sentence = ForeignKey(Sentence, related_name='tokens')
Sentence.objects.annotate(n=Count('tokens', distinct=True)).filter(n=5).filter(tokens__name__in=['se']).annotate(n0=F('tokens')).filter(tokens__name__in=['faire']).annotate(n1=F('tokens')).filter(tokens__name__in=['faire']).annotate(n2=F('tokens')).filter(tokens__name__in=['un']).annotate(n3=F('tokens')).filter(tokens__name__in=['avoir']).annotate(n4=F('tokens'))
Above code generates the following query:
SELECT "sentence"."id", "sentence"."name", COUNT(DISTINCT "token"."id") AS "n", T3."id" AS "n0", T4."id" AS "n1", T4."id" AS "n2", T6."id" AS "n3", T6."id" AS "n4" FROM "sentence" LEFT OUTER JOIN "token" ON ("sentence"."id" = "token"."sentence_id") INNER JOIN "token" T3 ON ("sentence"."id" = T3."sentence_id") INNER JOIN "token" T4 ON ("sentence"."id" = T4."sentence_id") INNER JOIN "token" T5 ON ("sentence"."id" = T5."sentence_id") INNER JOIN "token" T6 ON ("sentence"."id" = T6."sentence_id") INNER JOIN "token" T7 ON ("sentence"."id" = T7."sentence_id") WHERE (T3."name" IN (se) AND T4."name" IN (faire) AND T5."name" IN (un) AND T6."name" IN (avoir) AND T7."name" IN (faire)) GROUP BY "word_frword"."id", T3."id", T4."id", T6."id" HAVING COUNT(DISTINCT "token"."id") = 5
Why is numbering so strange (starts with T3)? But moreover why n2 is assigned to T4, not T5? Same for n4 and T6. Looks like numbers go by 2.
What I want to accomplish is capture token id on each step of inner join. It works when there are one join, but then it breaks.
Any suggestions?

Redshift: insert column C1 of table T1 into column C2 of Table T2

I have two tables:
T1 with columns A1, A2, A3, A4,...., A20.
T2 with columns B1, B2, B3,...., B15.
The data type of all columns is varchar.
I want to copy all values of column range A1-A10 to B1-B10. How do I do so in Redshift? I tried:
insert into T2(B1,B2,...,B10) select A1 A2 A3 ... A10 from T1
but it failed. I corrected errors like missing ), (dot) in the column name.
How can I insert selected column from one table to another? Is there any other way to do that?
You need to do insert into T2 (select A1, A2 ... A10 from T1).
I tested with following queries and things worked fine for me:
create temp table T1 (a varchar(5), b varchar(5), c varchar(5), d varchar(5), e varchar(5));
insert into T1 values ('t11', 't12', 't13', 't14', 't15');
create temp table T2 (a varchar(5), b varchar(5), c varchar(5));
insert into T2 values ('t21', 't22', 't23');
insert into T2 (select a, b, c from T1);
select * from T2;
The last line correctly printed the following:
t21 t22 t23
t11 t12 t13

What is the wrong in this SAS AML Codes

proc sql noprint;
CREATE TABLE WORK.TRANS_SENT_TO_USA AS
SELECT DISTINCT T3.role_desc , T1.ACCOUNT_KEY , T1.TRANSACTION_KEY ,
T2.MOnth_key ,
T1.DATE_KEY as DAY ,
T1.CURRENCY_AMOUNT_IN_ACCOUNT_CCY as AMOUNT ,
T5.ACCOUNT_CURRENCY_CODE as currency ,
T6.FULL_NAME,
MAX(T4.BANK_NAME) as Ben_Bank
FROM DB_CORE.FSC_CASH_FLOW_FACT T1
INNER JOIN DB_CORE.FSC_DATE_DIM T2
ON T1.DATE_KEY = T2.DATE_KEY
INNER JOIN DB_CORE.FSC_CASH_FLOW_BANK_BRIDGE T3
ON T1.TRANSACTION_KEY = T3.TRANSACTION_KEY
INNER JOIN DB_CORE.FSC_BANK_DIM T4
ON T3.BANK_KEY = T4.BANK_KEY
INNER JOIN DB_CORE.FSC_ACCOUNT_DIM T5
ON T1.ACCOUNT_KEY = T5.ACCOUNT_KEY
INNER JOIN DB_CORE.FSC_EXT_PARTY_ACCOUNT_DIM T6
ON T1.BENEFICIARY_EXT_PARTY_KEY = T6.EXT_PARTY_ACCOUNT_KEY
WHERE T2.CALENDAR_DATE >= "&LAST_RUN_DATE"D
AND T3.ROLE_DESC like '%BENEFICIARY%'
AND T4.BANK_COUNTRY_CODE LIKE 'US%'
Group by T3.role_desc ,T1.ACCOUNT_KEY , T1.TRANSACTION_KEY ,
T2.MOnth_key ,
T1.DATE_KEY ,
T1.CURRENCY_AMOUNT_IN_ACCOUNT_CCY ,
T5.ACCOUNT_CURRENCY_CODE ;
RUN;
FULL_NAME is not in the GROUP BY. That should cause problems, but then again you didn't specifically describe your problem(s).