Airflow task with trigger_rule none_failed doesn't get skipped correctly - directed-acyclic-graphs

At the end of my workflow I'm trying to implement 2 tasks in parallel with different trigger rules to log whether the execution of certain tasks was succesful or not. For this I've created 2 tasks with the trigger rules 'one failed' and 'none_failed'. The one failed task gets skipped in case no previous tasks have failed. But in the case of a failure in the parent tasks the none failed task isn't skipped but instead gets an upstream failed even though this shouldn't be the case if I understand the documentation correctly.
Visualisations of the DAG runs:
task with one_failed TR correctly being skipped
task with none_failed TR getting the status upstream_failed instead of skipped
Furthermore if I look into the code behind the dependencies it seems that the code is implemented to behave like this:
if flag_upstream_failed:
if tr == TR.ALL_SUCCESS:
if upstream_failed or failed:
ti.set_state(State.UPSTREAM_FAILED, session)
elif skipped:
ti.set_state(State.SKIPPED, session)
elif tr == TR.ALL_FAILED:
if successes or skipped:
ti.set_state(State.SKIPPED, session)
elif tr == TR.ONE_SUCCESS:
if upstream_done and not successes:
ti.set_state(State.SKIPPED, session)
elif tr == TR.ONE_FAILED:
if upstream_done and not (failed or upstream_failed):
ti.set_state(State.SKIPPED, session)
elif tr == TR.NONE_FAILED:
if upstream_failed or failed:
ti.set_state(State.UPSTREAM_FAILED, session)
elif skipped == upstream:
ti.set_state(State.SKIPPED, session)
But this doesn't seem to mesh with the way the airflow documentation describes the none_failed rule
The operators and dependencies of the test DAG that I'm using, sorry if it seems unnecessarily complex, was trying out some other things as well.
firstTask = BashOperator(
task_id='first',
bash_command='echo first'
)
secondTask = BashOperator(
task_id='second',
bash_command='echo second'
)
thirdTask = BashOperator(
task_id='third',
bash_command='echo third'
)
fifthTask = BashOperator(
task_id='fifth',
bash_command='echog fifth'
)
sixthTask = BashOperator(
task_id='sixth',
bash_command='echo sixth'
)
seventhTask = BashOperator(
task_id='seventh',
trigger_rule='one_failed',
bash_command='echo seventh'
)
eightTask = BashOperator(
task_id='eight',
trigger_rule='none_failed',
bash_command='echo eight'
)
firstDummy = DummyOperator(
task_id='firstDummy'
)
secondDummy = DummyOperator(
task_id='secondDummy'
)
thirdDummy = DummyOperator(
task_id='thirdDummy'
)
firstTask >> secondTask
firstTask >> thirdTask
secondTask >> firstDummy
thirdTask >> firstDummy
fifthTask >> secondDummy
sixthTask >> secondDummy
firstDummy >> seventhTask
secondDummy >> seventhTask
firstDummy >> eightTask
secondDummy >> eightTask
eightTask >> thirdDummy
seventhTask >> thirdDummy

Related

AWS Redshift Stored Procedure Abortions

Stored Procedure Abortions
I've got two stored procedures which run every 4 hours. I'm experiencing this odd behaviour where both of these procedures get aborted exactly the same number of times as successful runs (see table below). Using pg_catalog.svl_stored_proc_call to get the proc run status.
When I look at pg_catalog.stl_load_errors I can't see any errors for the runs.
What's the best way to investigate this behaviour?
Code for datawarehouse.p_add_missing_tbls()
DECLARE
row RECORD;
row2 RECORD;
BEGIN
FOR row IN select * from
(
select distinct
concat(concat(lower(regexp_replace(d.service,'-','_')), '_'),
lower(regexp_replace(regexp_replace(d.entity_name,'::',''),'(.)([A-Z]+)','$1_$2'))) as "tbl_name",
concat(concat(concat(concat(lower(regexp_replace(d.service,'-','_')),'_'),lower(regexp_replace(regexp_replace(d.entity_name,'::',''),'(.)([A-Z]+)','$1_$2'))),'_'), d."key") as "key" ,
d.value_type,
case d.value_type
when '0' then 'varchar(32768)'
when '1' then 'numeric(20,10)'
when '2' then 'int(2)'
when '3' then 'timestamp'
when '4' then 'varchar(32768)'
end as "data_type"
from
datawarehouse.definitions d
where
d."key" not in ('id')
)
loop
--List of keys
-- select into row2 '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from datawarehouse.definitions where concat(concat(lower(service),'_'),lower(entity_name)) = row.tbl_name;
-- select into row2 '\'' || listagg(distinct "key",'\',\'') || '\'' as keys from datawarehouse.definitions where concat(concat(lower(definitions.service),'_'),lower(definitions.entity_name)) = row.tbl_name;
select into row2 '\'' || listagg(distinct (concat(concat(concat(concat(lower(regexp_replace(service,'-','_')),'_'),lower(regexp_replace(regexp_replace(entity_name,'::',''),'(.)([A-Z]+)','$1_$2'))),'_'), "key")),'\',\'') || '\'' as keys from datawarehouse.definitions where concat(concat(lower(regexp_replace(definitions.service,'-','_')),'_'),lower(regexp_replace(regexp_replace(definitions.entity_name,'::',''),'(.)([A-Z]+)','$1_$2'))) = row.tbl_name and definitions."key" not in ('id');
--Delete staging tbl
execute 'drop table if exists staging.staging_'||row.tbl_name||';';
--Create staging tbl
EXECUTE 'create table staging.staging_'||row.tbl_name||' AS
select *
from (select
attribute_values.entity_id as '||row.tbl_name||'_id,
-- definitions."key",
concat(concat(concat(concat(lower(regexp_replace(definitions.service,''-'',''_'')),''_''),lower(regexp_replace(regexp_replace(definitions.entity_name,''::'',''''),''(.)([A-Z]+)'',''$1_$2''))),''_''), definitions."key")::varchar as "key",
concat(concat(lower(regexp_replace(definitions.service,''-'',''_'')),''_''),lower(regexp_replace(regexp_replace(definitions.entity_name,''::'',''''),''(.)([A-Z]+)'',''$1_$2''))) as "tbl_name",
attribute_values.updated_at as "updated_at",
attribute_values.destroyed_upstream as "deleted_upstream",
case definitions.value_type
when ''0'' then attribute_values.string_value::varchar
when ''1'' then attribute_values.number_value::varchar
when ''2'' then attribute_values.boolean_value::varchar
when ''3'' then attribute_values.datetime_value::varchar
when ''4'' then attribute_values.array_value::varchar
end as "final_value"
from
datawarehouse.attribute_values
left join
datawarehouse.definitions
on
definitions.id = attribute_values.definition_id
where
attribute_values.updated_at>= coalesce(((select max(updated_at) from datawarehouse.'|| row.tbl_name || ' )), (select min(av.updated_at) from datawarehouse.attribute_values av ))
and
tbl_name='''||row.tbl_name||'''
and
definitions."key" not in (''id'')
order by
entity_id desc)
PIVOT (max(final_value) for "key" in ( ' || row2.keys || ' )
);';
-- Drop staging.staging_col_info_v
-- execute 'drop view staging.staging_col_info_v;';
-- Create view
execute ' create or replace view staging.staging_col_info_v AS
with staging_tbl_info as (
select
d.table_schema ,
d.table_name ,
d.column_name as "column_name"
from
pg_catalog.svv_columns d
where
d.table_schema = ''staging''
and
d.table_name like ''staging_%''
),
tbl_info as (
select
col_name,
data_type
from
datawarehouse.tbl_col_v
)
select
*
from
staging_tbl_info s
inner join
tbl_info h
on
h.col_name = s.column_name;';
END LOOP;
RETURN;

How to toggle condition in where clause

How can I implement below logic in where clause ?
below is the scenario.
IF #include_prime = 'YES'
Then result set should contain Promo_records along with non_prime_records
IF #include_prime = 'NO'
Then result set should contain only non_prime_records
I have tried as below, for both "Yes" and "No" selection I am getting same number of records
`DECLARE
#weeks INT = 2,
#week_ending DATE = '12/03/2022',
#include_prime CHAR(3)='YES'
BEGIN
SELECT
[PRType],
[PRName],
[PRNumber],
[PROffice],
[PRStudio],
[Manager],
[WeekEndDate]
FROM Source_table
WHERE
Source_table.[WeekEndDate] BETWEEN #week_ending AND DATEADD("WEEK",CAST(#weeks AS INT),#week_ending)
AND
( ( #include_prime = 'Yes'
OR Source_table.[PRType] = 'Prime'
)
OR
( #include_prime = 'No'
AND Source_table.[PRType] <> 'Prime'
)
END`

job aborted while writing parquet to s3 via glue jobs

My code looks like below which consists of transformations:
dictionaryDf = spark.read.option("header", "true").csv(
"s3://...../.csv")
web_notif_data = fullLoad.cache()
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
print("::::::data has been loaded::::::::::::")
distinct_campaign_name = web_notif_data.select(
trim(web_notif_data.campaign_name).alias("campaign_name")).distinct()
web_notif_data.createOrReplaceTempView("temp")
variablesList = Config.get('web', 'variablesListWeb')
web_notif_data = spark.sql(variablesList)
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
web_notif_data = web_notif_data.withColumn("camp", regexp_replace("campaign_name", "_", ""))
web_notif_data = web_notif_data.drop("campaign_name")
web_notif_data = web_notif_data.withColumnRenamed("camp", "campaign_name")
web_notif_data = web_notif_data.withColumn("channel", lit("web_notification"))
web_notif_data.createOrReplaceTempView("data")
campaignTeamWeb = Config.get('web', 'campaignTeamWeb')
web_notif_data = spark.sql(campaignTeamWeb)
web_notif_data.persist(StorageLevel.MEMORY_AND_DISK)
distinct_campaign_name = distinct_campaign_name.withColumn("camp", F.regexp_replace(
F.lower(F.trim(col("campaign_name"))),
"[^a-zA-Z0-9]", ""))
output_df3 = (
distinct_campaign_name.withColumn("cname_split",
F.explode(F.split(F.lower(F.trim(col("campaign_name"))), "_")))
.join(
dictionaryDf,
(
(
(F.col("function") == "contains") &
F.col("camp").contains(F.col("terms"))
) |
(
(F.col("function") == "match") &
F.col("campaign_name").contains("_") &
(F.col("cname_split") == F.col("terms"))
)
),
"left"
)
.withColumn(
"empty_is_other",
F.when(
(
F.col("product").isNull() &
F.col("product_category").isNull()
),
"other"
)
)
.withColumn(
"rn",
F.row_number().over(
Window.partitionBy("campaign_name")
.orderBy(
F.when(
F.col("function").isNull(), 3
).when(
F.col("function") == "match", 2
).otherwise(1),
F.length(F.col("terms")).desc(),
F.col("product").isNull()
)
)
)
.filter("rn=1")
.select(
"campaign_name",
F.coalesce("product", "empty_is_other").alias("prod"),
F.coalesce("product_category", "empty_is_other").alias("prod_cat"),
)
.na.fill("")
)
print(":::::::::::transformations have been done finally::::::::::::")
web_notif_data1 = web_notif_data # Just taking the backup of DF in case something goes wrong
web_notif_data = web_notif_data.drop("campaign_name")
web_notif_data = web_notif_data.withColumnRenamed("temp_campaign_name", "campaign_name")
veryFinalDF = web_notif_data.join(output_df3, "campaign_name", "left_outer")
# veryFinalDF.show(truncate=False)
veryFinalDF.write.mode("overwrite").parquet(aggregatedPath)
print("::::final data have been written successfully::::::")
where fullLoad is the data frame that reads from the Redshift table.This code works fine on 0.2 Million records. However, in production for 15 days, data could be around a minimum of 25 Million records. I don't know the size since the data is stored in the redshift table and we are reading from it and then processing the data. I am running this code via Glue jobs and it gets stuck on the last line i.e while writing the data as parquet. It gives me the below error:
I tried running it with 30 executors. It takes around 20 mins to load the data from Redshift into the fullLoad dataframe. What else can be done to avoid this error? I am new to AWS and Glue jobs.

In the local environment, the result value and the dataflow result values are different

Here is my input data.
ㅡ.Input(Local)
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
This is the result of running apache-beam in a local environment.
ㅡ.Outout(Local)
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,2
Iot,2017,c++,2
Iot,2017,spring,2
Web,2016,java,1
Web,2016,spring,1
However, when I run the google-cloud-platform dataflow and put it in a bucket, the results are different.
ㅡ. Storage(Bucket)
Web,2016,java,1
Web,2016,spring,1
Iot,2015,c,1
Iot,2015,c++,1
Iot,2015,python,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Iot,2017,c,1
Iot,2017,c++,1
Iot,2017,spring,1
Here is my code.
ㅡ. Code
#apache_beam
from apache_beam.options.pipeline_options import PipelineOptions
import apache_beam as beam
pipeline_options = PipelineOptions(
project='project-id',
runner='dataflow',
temp_location='bucket-location'
)
def pardo_dofn_methods(test=None):
import apache_beam as beam
class split_category_advanced(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
self.k = 1
self.pre_processing = []
self.window = beam.window.GlobalWindow()
self.year_dict = {}
self.category_index = 0
self.language_index = 1
self.year_index = 2;
self.result = []
def setup(self):
print('setup')
def start_bundle(self):
print('start_bundle')
def finish_bundle(self):
print('finish_bundle')
for ppc_index in range(len(self.pre_processing)) :
if self.category_index == 0 or self.category_index%3 == 0 :
if self.pre_processing[self.category_index] not in self.year_dict :
self.year_dict[self.pre_processing[self.category_index]] = {}
if ppc_index + 2 == 2 or ppc_index + 2 == self.year_index :
# { category : { year : {} } }
if self.pre_processing[self.year_index] not in self.year_dict[self.pre_processing[self.category_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] = {}
# { category : { year : c : { }, c++ : { }, java : { }}}
language = self.pre_processing[self.year_index-1].split(' ')
for lang_index in range(len(language)) :
if language[lang_index] not in self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]] :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][language[lang_index]] = 1
else :
self.year_dict[self.pre_processing[self.category_index]][self.pre_processing[self.year_index]][
language[lang_index]] += 1
self.year_index = self.year_index + 3
self.category_index = self.category_index + 1
csvFormat = ''
for category, nested in self.year_dict.items() :
for year in nested :
for language in nested[year] :
csvFormat+= (category+","+str(year)+","+language+","+str(nested[year][language]))+"\n"
print(csvFormat)
yield beam.utils.windowed_value.WindowedValue(
value=csvFormat,
#value = self.pre_processing,
timestamp=0,
windows=[self.window],
)
def process(self, text):
for word in text.split(self.delimiter):
self.pre_processing.append(word)
print(self.pre_processing)
#with beam.Pipeline(options=pipeline_options) as pipeline:
with beam.Pipeline() as pipeline:
results = (
pipeline
| 'Gardening plants' >> beam.Create([
'Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017',
])
| 'Split category advanced' >> beam.ParDo(split_category_advanced(','))
| 'Save' >> beam.io.textio.WriteToText("bucket-location")
| beam.Map(print) \
)
if test:
return test(results)
if __name__ == '__main__':
pardo_dofn_methods_basic()
Code for executing simple word counting.
CSV column has an [ category, year, language, count ]
e.g) IoT, 2015, c, 1
Thank you for reading it.
The most likely reason you are getting different output is because of parallelism. When using DataflowRunner, operations run as parallel as possible. Since you are using a ParDo to count, when element Iot,c c++ spring,2017 goes to two different workers, the count doesn't happen as you want (you are counting in the ParDo).
You need to use Combiners (4.2.4)
Here you have an easy example of what you want to do:
def generate_kvs(element, csv_delimiter=',', field_delimiter=' '):
splitted = element.split(csv_delimiter)
fields = splitted[1].split(field_delimiter)
# final key to count is (Source, year, language)
return [(f"{splitted[0]}, {splitted[2]}, {x}", 1) for x in fields]
p = beam.Pipeline()
elements = ['Iot,c c++ python,2015',
'Web,java spring,2016',
'Iot,c c++ spring,2017',
'Iot,c c++ spring,2017']
(p | Create(elements)
| beam.ParDo(generate_kvs)
| beam.combiners.Count.PerKey()
| "Format" >> Map(lambda x: f"{x[0]}, {x[1]}")
| Map(print))
p.run()
This would output the result you want no matter the distribution you get of elements across workers.
Note the idea of Apache Beam is to parallelise as much as possible and, in order to aggregate, you need Combiners
I would recommend you to check some wordcounts examples so you get the hang of the combiners
EDIT
Clarification on Combiners:
ParDo is a operation that happens in a element to element basis. It takes one element, makes some operations and sends the output to the next PTransform. When you need to do aggregate data (count elements, sum values, join sentences...), element wise operations don't work, you need something that takes a PCollection (i.e., many elements with a logic) and outputs something. This is where the combiners come in, they perform operations in a PCollection basis, which can be made across workers (part of the Map-Reduce operations)
In your example, you were using a Class parameter to store the count in the ParDo, so when a element went through it, it would change the parameter within the class. This would work when all elements go through the same worker, since the Class is "created" in a worker basis (i.e., they don't share states), but when there are more workers, the count (with the ParDo) is going to happen in each worker separately

Regex / subString to extract all matching patterns / groups

I get this as a response to an API hit.
1735 Queries
Taking 1.001303 to 31.856310 seconds to complete
SET timestamp=XXX;
SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
38 Queries
Taking 1.007646 to 5.284330 seconds to complete
SET timestamp=XXX;
show slave status;
6 Queries
Taking 1.021271 to 1.959838 seconds to complete
SET timestamp=XXX;
SHOW SLAVE STATUS;
2 Queries
Taking 4.825584, 18.947725 seconds to complete
use marketing;
SET timestamp=XXX;
SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
I have extracted this out of the response html and have it as a string now.I need to retrieve values as concisely as possible such that I get a map of values of this format Map(Query -> T1 to T2 seconds) Basically what this is the status of all the slow queries running on MySQL slave server. I am building an alert system over it . So from this entire paragraph in the form of String I need to separate out the queries and save the corresponding time range with them.
1.001303 to 31.856310 is a time range . And against the time range the corresponding query is :
SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
This information I was hoping to save in a Map in scala. A Map of the form (query:String->timeRange:String)
Another example:
("use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified xyz ;"->"4.825584 to 18.947725 seconds")
"""###(.)###(.)\n\n(.*)###""".r.findAllIn(reqSlowQueryData).matchData foreach {m => println("group0"+m.group(1)+"next group"+m.group(2)+m.group(3)}
I am using the above statement to extract the the repeating cells to do my manipulations on it later. But it doesnt seem to be working;
THANKS IN ADvance! I know there are several ways to do this but all the ones striking me are inefficient and tedious. I need Scala to do the same! Maybe I can extract recursively using the subString method ?
If you want use scala try this:
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
val txt = """
|1735 Queries
|
|Taking 1.001303 to 31.856310 seconds to complete
|
|SET timestamp=XXX; SELECT * FROM ABC_EM WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
|
|38 Queries
|
|Taking 1.007646 to 5.284330 seconds to complete
|
|SET timestamp=XXX; show slave status;
|
|6 Queries
|
|Taking 1.021271 to 1.959838 seconds to complete
|
|SET timestamp=XXX; SHOW SLAVE STATUS;
|
|2 Queries
|
|Taking 4.825584, 18.947725 seconds to complete
|
|use marketing; SET timestamp=XXX; SELECT * FROM ABC WHERE last_modified >= 'XXX' AND last_modified < 'XXX';
""".stripMargin
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(acc,el) =>
val (taking,map) = acc // taking contains range
taking match {
case Some(range) if el.trim.nonEmpty => //Some contains range
(None,map + ( el -> range)) // add to map
case None =>
regex.findFirstIn(el) match { //extract range
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map) // probably empty line
}
}
map
}
Modified ajozwik's answer to work for SQL commands over multiple lines :
val regex = """(\d+).(\d+).*(\d+).(\d+) seconds""".r // extract range
def logToMap(txt:String) = {
val (_,map) = txt.lines.foldLeft[(Option[String],Map[String,String])]((None,Map.empty)){
(accumulator,element) =>
val (taking,map) = accumulator
taking match {
case Some(range) if element.trim.nonEmpty=> {
if (element.contains("Queries"))
(None, map)
else
(Some(range),map+(range->(map.getOrElse(range,"")+element)))
}
case None =>
regex.findFirstIn(element) match {
case Some(range) => (Some(range),map)
case _ => (None,map)
}
case _ => (taking,map)
}
}
println(map)
map
}