we're facing a very strange behavior in Airflow.
Some subdags stuck running (with its internal tasks successfully ended).
For example, I have the subdag load-folder-to-layer that should have started and ended in 2020-11-18, but it stuck until 20th.
I looked in the task_instances and job table and could see that the task's job was executing and receiving heartbeat:
The last log message in the subdag is:
[2020-11-18 09:22:01,879] {taskinstance.py:664} INFO - Dependencies not met for <TaskInstance: bietlejuice.docx.load-folder-reference-to-clean 2020-11-17T03:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
Also, I could notice a node drop on the exact same day 18th (image):
This leads me to think that the Scheduler is facing a bug when reatributing this task(subdag) to another worker and leading the task to get stuck.
Does someone have a clue about it?
Related
After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.
I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process
I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.
I am running Dataflow-Jobs on Google Cloud Platform and one new Error I get is "Workflow failed" without any explanations.
The logs I get are the following:
2017-08-25 (00:06:01) Executing operation ReadNewXXXFromStorage/Read+JsonStringsToXXX+RemoveLanguagesFromXXX...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/GroupByKey/Create
2017-08-25 (00:06:01) Starting 1 workers in europe-west1-b...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/ParDo(SplitQuery)+ReadOldXYZ...
2017-08-25 (00:06:48) Workflow failed.
2017-08-25 (00:06:48) Stopping worker pool...
2017-08-25 (00:06:58) Worker pool stopped.
How am I supposed to find out whats going wrong? It should not be a problem with rights on the object, as similar jobs run successfully.
When I try to rerun the template from Google Cloud Console, I get the message:
No metadata file found for this template
But I am able to start the template and now it runs successfully. May this have to do with exceeded quotas? We just increased our CPU and IP-Quota for Dataflow and I increased our parallel running jobs from 5 to 15 to be able to use the quota. When I rerun the template without any other Jobs running, everything seems to work fine.
Any Input is highly appreciated. Thanks
EDIT: Seems like the Jobs failed because of exceeded CPU-Quota, but usually we would get an error-description where it says "could not spawn enough workers". Nevertheless, Everything works fine after I reduced the maximum number of workers per job, so that our quota cannot be exceeded.
I believe the "No metadata file found for this template" should be considered a warning, not an error. A template is able to have a "metadata" file associated with it which allows validation of parameters. If no such file is present, the parameters aren't validated, but everything else works as normal -- the message is just the indicator of this situation.
It sounds like the problem was the job being unable for other reasons. Based on your description and the edit, it sounds like this was because of lack of quota to run the job.
I am curious to understand the loss of connectivity with AWS SWF on currently executing workflows. Could someone please help me understand.
I understand there would be timeout of deciders and workers. But not sure of the exact behavior.
Activity worker that waits on a poll will get an error and is expecting to keep retrying until connectivity is back. Activity worker that has completed a task is expected to keep retrying to complete the task until the task is expired.
Workflow worker that waits on a poll will get an error and is expecting to keep retrying until connectivity is back. Workflow worker that has completed a decision task can retry to complete it until it is expired. After it is expired the decision task is automatically rescheduled and is available for poll as soon as connectivity is back.
Scheduled activity that wasn't picked up for a specified schedule to start timeout is automatically failed. Its failure is recorded into workflow history and the new decision is scheduled.
Picked up activity that wasn't completed for a specified start to complete timeout is automatically failed. Its failure is recorded into workflow history and the new decision is scheduled.
I wrote a workflow using aws flow framework for java. It is working fine. But I am facing an issue while trying to re-run this workflow after some time.
After registering this workflow's workers, re-run works fine till some time no matter how many times I try it but after that it suddenly stops working and gets stuck at DecisionTaskScheduled event and eventually times out. I checked the history and it shows "No Activities found for the given execution". Although if I manually re-register the activities, it again starts working. Please help me to fix this issue.
If a workflow execution is stuck at DecisionTaskScheduled event then workflow worker is not running or having some issue. I'm not sure what you mean by "registering workflow workers". Workflow type is registered only once and there is no need to reregister it. Workers are just running executing polls to SWF and processing decision tasks. Make sure that workflow worker is running and is not stuck for any reason.