Airflow DAG is running for all the retries - airflow-scheduler

I have a DAG running since few months and from last one week it's behaving abnormal. i am running a bash operator which is executing a shell script and in shell script we have a hive query.
no of retries set to 4 as below.
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 4,
'retry_delay': timedelta(minutes=5)
}
i can see in the log that it's triggering the hive query and loosing the heartbeats after some time(around 5 to 6 minutes) and going for the retry.
Yarn is showing that query is not yet finished but airflow triggered the next run. now in the yarn 2 queries are running (one for the first run and second for the retry) for the same task.similarly this dag is triggering 5 queries(as retry is 4) for the same task and showing the failed status in the last.
Interesting point is that the same dag was running fine from long time. also, this is the issue will all the dags related to hive in the production.
today i upgraded to latest version of airflow v 1.10.9.
I am using LocalExecuter in this case.
Did anyone have faced the similar issue?

Airflow UI doesn't initiate the retries on its own, irrespective of whether it's connected to backend DB or not. It seems like your task executors are going Zombie, in that case Scheduler's Zombie detection kicks in and call the task instances (TI's) handle_failure method. So in nutshell, you can override that method in your dag and add some logging to see at what's happening, in fact you should be able to Hadoop RM and check the status of your job and make a decision accordingly including canceling retries.
For example, see this code, which I wrote to handle Zombie failures only.

Related

AWS Glue job run not respecting Timeout and not stopping

I am running AWS Glue jobs using PySpark. They have set Timeout (as visible on the screenshot) of 1440 mins, which is 24 hours. Nevertheless, the job continues working over those 24 hours.
When this particular job had been running for over 5 days I stopped it manually (clicking stop icon in column "Run status" in GUI visible on the screenshot). However, since then (it has been over 2 days) it still hasn't stopped - the "Run status" is Stopping, not Stopped.
Additionally, after about 4 hours of running, new logs (column "Logs") in CloudWatch regarding this Job Run stopped appearing (in my PySpark script I have print() statements which regularly and often log extra data). Also, last error log in CloudWatch (column "Error logs") has been written 24 seconds after the date of the newest log in "Logs".
This behaviour continues for multiple jobs.
My questions are:
What could be reasons for Job Runs not obeying set Timeout value? How to fix that?
Why the newest log is from 4 hours since starting the Job Run, while the logs should appear regularly during 24 hours of the (desired) duration of the Job Run?
Why the Job Runs don't stop if I try to stop them manually? How can they be stopped?
Thank you in advance for your advice and hints.

Managed Workflows with Apache Airflow (MWAA) - how to disable task run dependency on previous run

I have an Apache Airflow managed environment running in which a number of DAGs are defined and enabled. Some DAGs are scheduled, running on a 15 minute schedule, while others are not scheduled. All the DAGs are single-task DAGs. The DAGs are structured in the following way:
level 2 DAGs -> (triggers) level 1 DAG -> (triggers) level 0 DAG
The scheduled DAGs are the level 2 DAGs, while the level 1 and level 0 DAGs are unscheduled. The level 0 DAG uses ECSOperator to call a pre-defined Elastic Container Service (ECS) task, to call a Python ETL script inside a Docker container defined in the ECS task. The level 2 DAGs wait on the level 1 DAG to complete, which in turns waits on the level 0 DAG to complete. The full Python logs produced by the ETL scripts are visible in the CloudWatch logs from the ECS task runs, while the Airflow task logs only show high-level logging.
The singular tasks in the scheduled DAGs (level 2) have depends_on_past set to False, and I expected that as a result successive scheduled runs of a level 2 DAG would not depend on each other, i.e. that if a particular run failed it would not prevent the next scheduled run from occurring. But what is happening is that Airflow is overriding this and I can clearly see in the UI that a failure of a particular level 2 DAG run is preventing the next run from being selected by the scheduler - the next scheduled run state is being set to None, and I have to manually clear the failed DAG run state before the scheduler can schedule it again.
Why does this happen? As far as I know, there is no Airflow configuration option that should override the task-level setting of False for depends_on_past in the level 2 DAG tasks. Any pointers would be greatly appreciated.
Answering the question "why is this happening?". I understand that the behavior you are watching is explained by the fact that Tasks are being defined with wait_for_downstream = True. The docs state the following about it:
wait_for_downstream (bool) -- when set to true, an instance of task X will wait for tasks immediately downstream of the previous instance of task X to finish successfully or be skipped before it runs. This is useful if the different instances of a task X alter the same asset, and this asset is used by tasks downstream of task X. Note that depends_on_past is forced to True wherever wait_for_downstream is used. Also note that only tasks immediately downstream of the previous task instance are waited for; the statuses of any tasks further downstream are ignored.
Keep in mind that the term previous instances of task X refers to the task_instance of the last scheduled dag_run, not the upstream Task (in a DAG with a daily schedule, that would be the task_instance from "yesterday").
This also explains why your Task are being executed once you clear the state of the previous DAG Run.
I hope it helps you clarifying things up!

EMR Job Long Running Notifications

Consider we have around 30 EMR Jobs runs in 5:30 AM PST to 10:30 PST.
We have S3 Buckets and we use to receive flat files in S3 bucket and through lambda functions, received files will be copied to other target paths.
We have dynamo DB tables for data processing once data gets received in target path.
Now the problem area is since we have multiple dependencies & parallel execution, sometimes job gets failed due to memory issue as well as sometimes take more time to get completed.
Sometimes it will run for 4 or 5 hours, and finally it will get terminated with memory or any other issues like Subnet not available or EC2 issue. So we dont want to wait till that long time.
Eg: Job_A process some 1st to 4th files and Job_B processes from 5th to 10th files. Like that it goes.
Here Job_B has dependency with Job_A with 3rd file. So, Job_B will wait until Job_A gets completed. Like this dependency we have in our process.
I would like to get notification from EMR Jobs like below,
Eg: Average Running time for Job_A is 1 hour, but it is running for more than 1 hour and in this case I need to get notified by email or any other way.
How to achieve it? Please help or advise anyone.
Regards,
Karthik
Repeatedly call the list of steps by using lambda and aws sdk, e.g. boto3 and check the start date. When it is 1 hour behind, then you can trigger some notification like Amazon SES. See the documentation.
For example, you can call the list_steps for the running steps only.
response = client.list_steps(
ClusterId='string',
StepStates=['RUNNING']
)
Then it will give you below response.
{
'Steps': [
{
...
'Status': {
...
'Timeline': {
'CreationDateTime': datetime(2015, 1, 1),
'StartDateTime': datetime(2015, 1, 1),
'EndDateTime': datetime(2015, 1, 1)
}
}
},
],
...
}

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

Airflow backfill not completely filling up to the present

I'm struggling with a weird problem I can't seem to figure out. I have a basic DAG that does nothing fancy. It just makes use of the bash operator to start a Python script.
I have this DAG scheduled to run every Monday. When I switch on the dag in the webserver it starts backfilling up to the 25th of September. However, it does not backfill for the 2nd of October. When I change the schedule from weekly to daily, it works fine.
This is my DAG setup.
default_args = {
'owner': 'xxxxx',
'depends_on_past': False,
'email': ['xxxxxx'],
'start_date': datetime(2017, 9, 1, 0, 0),
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
}
# Create DAG
dag = DAG(dag_id='IFS_weekly_forecast',
schedule_interval = '0 8 * * MON',
default_args=default_args)
As you can see from this picture, the backfilling is just fine up to the 25th. After that there are no new tasks queued.
What am I doing wrong here? I have other DAGs have been running fine for weeks. I also restarted the scheduler and the webserver, this did not help.
Edit:
The topic below seems to cover the same problem. However, this changes my question. How can I let airflow run weekly on a given date, instead of waiting for that entire period to finish?
Airflow does not backfill latest run