Managed Workflows with Apache Airflow (MWAA) - how to disable task run dependency on previous run - amazon-web-services

I have an Apache Airflow managed environment running in which a number of DAGs are defined and enabled. Some DAGs are scheduled, running on a 15 minute schedule, while others are not scheduled. All the DAGs are single-task DAGs. The DAGs are structured in the following way:
level 2 DAGs -> (triggers) level 1 DAG -> (triggers) level 0 DAG
The scheduled DAGs are the level 2 DAGs, while the level 1 and level 0 DAGs are unscheduled. The level 0 DAG uses ECSOperator to call a pre-defined Elastic Container Service (ECS) task, to call a Python ETL script inside a Docker container defined in the ECS task. The level 2 DAGs wait on the level 1 DAG to complete, which in turns waits on the level 0 DAG to complete. The full Python logs produced by the ETL scripts are visible in the CloudWatch logs from the ECS task runs, while the Airflow task logs only show high-level logging.
The singular tasks in the scheduled DAGs (level 2) have depends_on_past set to False, and I expected that as a result successive scheduled runs of a level 2 DAG would not depend on each other, i.e. that if a particular run failed it would not prevent the next scheduled run from occurring. But what is happening is that Airflow is overriding this and I can clearly see in the UI that a failure of a particular level 2 DAG run is preventing the next run from being selected by the scheduler - the next scheduled run state is being set to None, and I have to manually clear the failed DAG run state before the scheduler can schedule it again.
Why does this happen? As far as I know, there is no Airflow configuration option that should override the task-level setting of False for depends_on_past in the level 2 DAG tasks. Any pointers would be greatly appreciated.

Answering the question "why is this happening?". I understand that the behavior you are watching is explained by the fact that Tasks are being defined with wait_for_downstream = True. The docs state the following about it:
wait_for_downstream (bool) -- when set to true, an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully or be skipped before it runs. This is useful if the different instances of a task X alter the same asset, and this asset is used by tasks downstream of task X. Note that depends_on_past is forced to True wherever wait_for_downstream is used. Also note that only tasks immediately downstream of the previous task instance are waited for; the statuses of any tasks further downstream are ignored.
Keep in mind that the term previous instances of task X refers to the task_instance of the last scheduled dag_run, not the upstream Task (in a DAG with a daily schedule, that would be the task_instance from "yesterday").
This also explains why your Task are being executed once you clear the state of the previous DAG Run.
I hope it helps you clarifying things up!

Related

MWAA Airflow Scaling: what do I do when I have to run frequent & time consuming scripts? (Negsignal.SIGKILL)

I have an MWAA Airflow env in my AWS account. The DAG I am setting up is supposed to read massive data from S3 bucket A, filter what I want and dump the filtered results to S3 bucket B. It needs to read every minute since the data is coming in every minute. Every run processes about 200MB of json data.
My initial setting was using env class mw1.small with 10 worker machines, if I only run the task once in this setting, it takes about 8 minutes to finish each run, but when I start the schedule to run every minute, most of them could not finish, starts to take much longer to run (around 18 mins) and displays the error message:
[2021-09-25 20:33:16,472] {{local_task_job.py:102}} INFO - Task exited with return code Negsignal.SIGKILL
I tried to expand env class to mw1.large with 15 workers, more jobs were able to complete before the error shows up, but still could not catch up with the speed of ingesting every minute. The Negsignal.SIGKILL error would still show before even reaching worker machine max.
At this point, what should I do to scale this? I can imagine opening another Airflow env but that does not really make sense. There must be a way to do it within one env.
I've found the solution to this, for MWAA, edit the environment and under Airflow configuration options, setup these configs
celery.sync_parallelism = 1
celery.worker_autoscale = 1,1
This will make sure your worker machine runs 1 job at a time, preventing multiple jobs to share the worker, hence saving memory and reduces runtime.

Cloud Tasks Conditional Execution

I am using Cloud Tasks. I need to trigger the execution of Task C only when Task A and Task B have been completed successfully. So I need some way of reading / being notified of the statuses of Tasks triggered. But I see no way of doing this in GCP's documentation. Using Node.js SDK to create tasks and Cloud Functions as task handlers if at all that helps.
Edit:
As requested, here is more info on what we are doing:
Tasks 1 - 10 each make HTTP requests, fetch data, update individual collections in Firestore based on this data. These 10 tasks can run in parallel and in no particular order as they don't have any dependency on each other. All of these tasks are actually implemented inside GCF.
Task 11 actually depends on the Firestore collection data updated by Tasks 1 - 10. So it can only run after Tasks 1 - 10 are completed successfully.
We do issue a RunID as a common identifier to group a particular run of all tasks (1 - 11).
Cloud Task only trigger task, you can only define time condition. You have to code manually the check when the task C run.
Here an example of process:
Task A is running, at the end, the task write in firestore that is completed
Task B is running, at the end, the task write in firestore that is completed
Task C start and check if A and B are completed in firestore.
If not, the task exit in error
Is yes, continue the process
You have to customize your C task queue for retrying the task in case of error.
Another, expensive, solution is to use Cloud Composer for handling this workflow
There is no other solution for now about workflow management.
Cloud Tasks is not the tool you want to use in this case. Take a look into Cloud Composer which is built in top of Apache Airflow for GCP.
Edit: You could create a GCF to handle the states of those requests
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
################ TASK A
taskA_list = [
"https://via.placeholder.com/400",
"https://via.placeholder.com/410",
"https://via.placeholder.com/420",
"https://via.placeholder.com/430",
"https://via.placeholder.com/440",
"https://via.placeholder.com/450",
"https://via.placeholder.com/460",
"https://via.placeholder.com/470",
"https://via.placeholder.com/480",
"https://via.placeholder.com/490",
]
def call2TaskA(url):
html = requests.get(url, stream=True)
return (url,html.status_code)
processes = []
results = []
with ThreadPoolExecutor(max_workers=10) as executor:
for url in taskA_list:
processes.append(executor.submit(call2TaskA, url))
isOkayToDoTaskB = True
for taskA in as_completed(processes):
result = taskA.result()
if result[1] != 200: # your validation on taskA
isOkayToDoTaskB = False
results.append(result)
if not isOkayToDoTaskB:
raise ValueError('Problems: {}'.format(results))
################ TASK B
def doTaskB():
pass
doTaskB()

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

How to throttle current jobs in airflow?

I am a newbie to Airflow. But I am now working on how to throttle current jobs in Airflow. Is there someone that knows a little about concurrency or throttling in Airflow. Any suggestions could be helpful.
Thanks a lot.
If you want to throttle tasks in a dag, you need to define its "concurrency" parameter.
"concurrency" defines how many running task instances a DAG is allowed
to have, beyond which point things get queued.
If you want to throttle tasks globally, look into this lines of the config file
The amount of parallelism as a setting to the executor. This defines
the max number of task instances that should run simultaneously
on this airflow installation
parallelism = 32
And
The number of task instances allowed to run concurrently by the scheduler
dag_concurrency = 16
The first is global, the second is the concurrency default value for all dags

How is it that a mapreduce pipeline can run longer than 10 minutes?

MapReduce tasks are run within a parent pipeline, and of course we all know they can run for a very long time. But at the same time, the pipeline api documents that a pipeline must complete within 10 minutes (https://github.com/GoogleCloudPlatform/appengine-pipelines/wiki/Python). What is the proper way to understand this?
Thanks.
That pipeline documentation is really old... when it was written, tasks were limited to 10-mins. Now you can configure a non-default modules (used to be called a "backend") using basic/manual scaling that will allow a task to run for 24hrs
https://cloud.google.com/appengine/docs/python/modules/#Python_Instance_scaling_and_class
(NOTE: if you run a task on an auto-scaled module, it will still be limited to 10-mins)
The entire pipeline doesn't have to be limited to 24hrs though. The "root" pipeline (the first task that runs) can yield many child pipelines, and those each can further yield other pipelines... each pipeline is a task that has to run within the allotted time (10mins or 24hrs)... when it is done, it signals the parent to wake-up and finish... so the overall pipeline could run for days or months or whatever
We have our app split into two modules, one for the front-end (default, auto-scaled) that handles web requests, and one for the "back end" (basic scaling) that runs all of our tasks