I have been using Airflow on AWS(MWAA) for a couple of months now and I noticed that occasionally some Airflow tasks fail with no discernable reason and with no log messages in Cloudwatch. I often have to clear and retry the tasks multiple times for the task to eventually succeed.
Does anyone know the reason why this would be happening? I'm thinking with the autoscaling feature of MWAA this should not be an issue of resources. Anyone with any experience dealing with something like this or with any idea would be greatly appreciated. Thanks
Related
I published my web job (trigerred type) in azure and it keep on running after my team add Console.Readline(). it seems the culprit here.
The Job Finished after done it's task and not waiting for the cpu.
Please check the below following steps that helps to fix the issue:
Console.ReadLine() should not be present in the Web Job Code.
Recheck the code if there is any infinite loop running which causes this kind of issue.
There are many differences between the Triggered and Continuous Web Job that gives an idea on stopping the Web Jobs. For that information, visit this SO Answer given by #Jay Gong.
I published my web job (trigerred type) in azure
Actually, Web Jobs main aim is of 2 ways either in triggered or Continuous running.
As you didn't give any code snippet, the ideal way is to use the host.Run() for the Triggered Web Jobs and host.Start() for the Continuous Web Jobs Type in the program.cscode file.
We have a service running that orchestrates starting Fargate ECS tasks on messages from a RabbitMQ-queue. Sometimes tasks weirdly fail to start.
Info:
It starts a task somewhere between every other minute and every ten minutes.
It uses a set amount of task definitions. It re-uses the task definitions.
It consistently uses the same subnet in the same VPC.
The problem:
The vast majority of tasks starts fine. Say 98%. Sometimes tasks fail to start, and I get error messages. The error messages are not always the same, but they seem to be network-related.
Error messages I have gotten the last 36 hours:
'Timeout waiting for network interface provisioning to complete.'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message'
'CannotPullContainerError: ref pull has been retried 5 time(s): failed to resolve reference <image that exists in repository>: failed to do request: Head https:<account-id>.dkr.ecr.eu-west-1.amazonaws.com/v2/k1-d...'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: context deadline exceeded'
Thoughts:
It looks to me like there is a network-connectivity error of some sort.
The result of my Googling tells me that at least some of the errors can arise from having wrongly configured VPC or route-tables.
This is not the case here, I assume, since starting the exact same task with the exact same task definition in the same subnet works fine most of the time.
The ENI problem could maybe arise from me running out of ENI:s (?) on an EC2-instance, but since these tasks are started through Fargate I feel like that should not be the problem.
It seems like at least the network provisioning error can sometimes be an AWS issue.
Questions:
Why is this happening? Is it me or AWS?
Depending on the answer to the first question, is there something I can do to avoid this?
If there is nothing I can do, is there something I can do to mitigate it while it's happening? Should I simply just retry starting the task and hope that solves it?
Thanks very much in advance, I have been chasing this problem for months and feel like I am at least closing in on it, but this is as far as I can get on my own, I fear.
It is possible that tasks may fail to start due to a certain amount of reasons. Some of them are transient and are more "AWS" some others are more structural of your configuration and are more "you". For example the network time out is often due to a network misconfiguration where the task ENI does not have a proper route to the registry (e.g. Docker Hub). In all other cases it is possible that it's a transient one-off issue of the Fargate internals.
These problems may be transparent to you OR you may need to take action depending on how you use Fargate. For example, if you use Fargate tasks as part of an ECS service or an EKS deployment, the ECS/EKS routines will make sure they retry to instantiate the task to meet the service/deployment target configuration.
If you are launching the Fargate task using a one-off RunTask API call (i.e. not part of an orchestrator control loop that can monitor its failure) then it depends how you are calling that API. If you are calling it from tools such as AWS Step Functions, AWS Batch and possibly others, they all have retry mechanisms so if a task fails to launch they are smart enough to re-launch it.
However, if you are launching the task from an imperative line of code (or CLI command etc) then it's on your code to make sure the task has been launched properly and that you don't need to re-launch it upon an error message.
I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?
This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long
We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)
I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)
I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.
Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.
I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.
We've got Celery/SQS set up for asynchronous task management. We're running Django for our framework. We have a celery task that has a self.retry() in it. Max_retries is set to 15. The retry is happening with an exponential backoff and takes 182 hours to complete all 15 retries.
Last week, this task went haywire, I think due to a bug in our code not properly handling a service outage. It resulted in exponential creation (retrying?) of the same celery task. It eventually used up all available memory and the worker crashed. Restarting the worker results in another crash a couple hours later, since all those tasks (and their retries) keep retrying and spawning new retries until we run out of memory again. Ultimately we ended up with nearly 600k tasks created!
We need our workers to ignore all the tasks with a specific celery GUID. Ideally we could just get rid of them for good. I was going to use revoke() but, per documentation (http://docs.celeryproject.org/en/3.1/userguide/workers.html#commands), this is only implemented for Redis and RabbitMQ, not SQS. Furthermore, when I go to the SQS service in the AWS console, it's showing zero messages in flight so it's not like I can just flush it.
Is there a way to delete or revoke a specific message from SQS using the Celery task ID? Or is there another way to fix this problem? Obviously we need to fix our code so we don't get into this situation again, but first we need to get our worker up and running because without it our website has reduced functionality. Thanks!