Google Dataflow "Workflow failed" with no reason - google-cloud-platform

I am running Dataflow-Jobs on Google Cloud Platform and one new Error I get is "Workflow failed" without any explanations.
The logs I get are the following:
2017-08-25 (00:06:01) Executing operation ReadNewXXXFromStorage/Read+JsonStringsToXXX+RemoveLanguagesFromXXX...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/GroupByKey/Create
2017-08-25 (00:06:01) Starting 1 workers in europe-west1-b...
2017-08-25 (00:06:01) Executing operation ReadOldXYZ_ABC_1234_123_ns_123123123123123/ParDo(SplitQuery)+ReadOldXYZ...
2017-08-25 (00:06:48) Workflow failed.
2017-08-25 (00:06:48) Stopping worker pool...
2017-08-25 (00:06:58) Worker pool stopped.
How am I supposed to find out whats going wrong? It should not be a problem with rights on the object, as similar jobs run successfully.
When I try to rerun the template from Google Cloud Console, I get the message:
No metadata file found for this template
But I am able to start the template and now it runs successfully. May this have to do with exceeded quotas? We just increased our CPU and IP-Quota for Dataflow and I increased our parallel running jobs from 5 to 15 to be able to use the quota. When I rerun the template without any other Jobs running, everything seems to work fine.
Any Input is highly appreciated. Thanks
EDIT: Seems like the Jobs failed because of exceeded CPU-Quota, but usually we would get an error-description where it says "could not spawn enough workers". Nevertheless, Everything works fine after I reduced the maximum number of workers per job, so that our quota cannot be exceeded.

I believe the "No metadata file found for this template" should be considered a warning, not an error. A template is able to have a "metadata" file associated with it which allows validation of parameters. If no such file is present, the parameters aren't validated, but everything else works as normal -- the message is just the indicator of this situation.
It sounds like the problem was the job being unable for other reasons. Based on your description and the edit, it sounds like this was because of lack of quota to run the job.

Related

No CloudWatch logs for ECS task with reason "Essential container in task exited"

A task is running for a few seconds before terminating, I don't know why, and it's not pushing any logs.
I'm using the "awslogs" driver and the log group exists in CloudWatch.
The "Logs" tab is empty. The log-stream is created in CW but it's devoid of actual log events. There are also no results under Insights for that stream.
The task role has the permissions mentioned at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html .
Any idea what the deal is with the logs?
The command wasn't valid nor was it comma-separated. It was terminating too early in the workflow to log anything, but yet after any other deployment issue would be identified. So, it was looking like it was successful but in reality wasn't yet even running. Interestingly, it would still take around a minute to terminate, so maybe this includes the overhead of pulling the image.
Timestamps indicate that task started and exited after some seconds. awslogs will send logs if container has been successfully started, so, in this case it may not be helping. You can follow step 6 of documentation to diagnose. Specifically, if you have a container that has stopped, expand the container and inspect the Status reason row to see what caused the task state to change. In most cases, that will lead you to actual cause

Long-running Dataflow job fails with no errors in user code

After running for 17 hours, my Dataflow job failed with the following message:
The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures.
The 4 failures consist of 3 workers losing contact with the service, and one worker reported dead:
****-q15f Root cause: The worker lost contact with the service.
****-pq33 Root cause: The worker lost contact with the service.
****-fzdp Root cause: The worker ****-fzdp has been reported dead. Aborting lease 4624388267005979538.
****-nd4r Root cause: The worker lost contact with the service.
I don't see any errors in the worker logs for the job in Stackdriver. Is this just bad luck? I don't know how frequently work items need to be retried, so I don't know what the probability is that a single work item will fail 4 times over the course of a 24 hour job. But this same type of job failure happens frequently for this long-running job, so it seems like we need some way to either decrease the failure rate of work items, or increase the allowed number of retries. Is either possible? This doesn't seem related to my pipeline code, but in case it's relevant, I'm using the Python SDK with apache-beam==2.15.0. I'd appreciate any advice on how to debug this.
Update: The "STACK TRACES" section in the console is totally empty.
I was having the same problem and it was solved by scaling up my workers resources. Specifically, I set --machine_type=n1-highcpu-96 in my pipeline configs. See this for a more extensive list on machine type options.
Edit: Set it to highcpu or highmem depending on the requirements of your pipeline process

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

How to debug failed fargate task initialization

I have a fargate task which I have scheduled to run with CloudWatch Event rules, and output a timestamp to a database on a successful run. It also outputs a logfile to CloudWatch for every time it runs.
However, there was 1 time where the log file was not created, and the database not updated. I suspect the task was never even started, or had failed to start.
In CloudWatch, the event rule shows trigger and invocation at the time I expected the task to run, so I assume the task at least attempted to start.
My question is: is there any way I can debug or log information about the cluster failing to start a task?
Please let me know if I need to provide more information.
Edit: I should specify I'm looking for a way to read this information in a log file somewhere. I know I can see failed task reason in the web console, but that's only for relatively recent tasks.
I have posted the same question here: https://www.reddit.com/r/aws/comments/adtqvt/debugging_failed_fargate_task_initialization/ and StackOverflow: https://forums.aws.amazon.com/thread.jspa?messageID=884638&#884638
Go to the cluster and choose the Tasks tab
In the lower pane, choose Stopped for the Desired Task Status value
Locate the desired Task and click it's GUID
Scroll down to the Containers section and expand the relevant containers that are experiencing errors
You'll see some kind of Status reason for the error. In my case it was:
CannotStartContainerError: API error (500): failed to initialize logging driver: Cannot determine region for awslogs driver
Edit: I can't really take credit for figuring this out - found it here:
https://github.com/aws/amazon-ecs-agent/issues/1654#issuecomment-437178282
Try going to "CloudWatch -> Logs -> Insights" and click on "Run Query":
I just faced this problem and the lack of logs did make it quite difficult to resolve.
The problem in my case was the security group used for the task had been deleted. Hope this helps if any one has a similar issue.

The operator or administrator has refused the request task scheduler

I have scheduled a C# console application in Task Scheduler of Windows 2012 R2. Application will run when executed it manually or Right click on scheduled task and click on Run, but it is failed when triggered by Task Scheduler with below error.
The operator or administrator has refused the request(0x800710E0)
I have followed below steps also after Google search
Selected "Run whether user logged in or not"
Unchecked "Start the task only if the computer is on AC power"
In my case, the error message "The operator or administrator has refused the request" meant that a previous instance of the task has still been running and the task was configured to not start a new instance if it's already running (the default configuration), so the Task Scheduler refused to start a new instance when the task was triggered.
You can find that option in a select box on the task's Settings tab, under the caption "If the task is already running, then the following rule applies". The default value is "Do not start a new instance".
But that error message is pretty confusing. From the other answers, you may see that it may mean many completely distinct errors. As is usual in Microsoft's products.
Tip
It's helpful to check the History tab of a task. That's where I have found out what's actually going on. There was an event "Launch request ignored, instance already running".
In my case, I had to redo the permissions on the task. Somehow it had lost the domain portion of the username. Instead of `DOMAIN\joeuser' it was just 'joeuser'. After a reset, it worked correctly as it had for the previous year.
In my case as per having a job setup with Task Scheduler as written about in the "Prevent a Task Scheduler Task from Executing on Setting Updates", I had a job setup to run every "X" minutes for a period of indefinitely.
Upon seeing the dreaded "The operator or administrator has refused the request" for the Last Run Result, I looked over the History tab and see detail indicating that is "missed its schedule".
The Solution
From the Settings tab of the job properties, I simply checked the option "Run task as soon as possible after a scheduled start is missed", and problem resolved; although, I did have to type in the credential again as well.
Note: This started occurring once a server was moved from a redundant backup server once hardware repair was completed back to the original hardware. The OS was Server 2012 R2 and the OS was moved to other hardware while repair was done on the production server but I didn't notice this there—maybe an oversight there though—not sure.
I know that #Sushmit-Patil found a solution, but I wanted to add a solution to my similar problem:
It turns out a prior process never exited (it was hanging around in memory because of a defect I had in my code). By default, Windows Task Scheduler won't run the process again if it's already running.
In addition to fixing the defect, in Task Scheduler, under the Settings tab, I set If the task is already running, then the following rule applies: to Run a new instance in parallel
1
Error occurred due to folder permission, I was creating CSV from my application, which was required folder permission to be granted. After giving Full Control to the folder error got resolved.
For me, the solution was to check Run with highest privileges in the properties.
In my case my task launches a PowerShell script--and it produced the "The operator or administrator has refused the request (0x800710E0)" error message as seen in the Task Scheduler's task-entry grid. My user name was correct, but when I dropped to a command prompt and simulated the task by running the PowerShell against my .ps1 file, I saw an Avast prompt that flagged my script as suspicious and wasn't allowing it to run. I created an Avast exception and now the task runs without any issue.
After turning on history I also had the error "Missed task start rejected: Task Scheduler did not launch task as it missed its schedule." but I didn't want the task to start when I woke up the computer, I wanted to figure out why the computer didn't wake up.
This answer helped me out -- by default Windows was waking for "Important Wake Timers Only" (system updates, but not my scheduled task).
In the setting Power Options > Edit Plan Settings > Change advanced power settings > Sleep > Allow wake timer change the option to "Enabled" and then your computer will wake up to run the task.
You can also do this from "settings". Probably earlier instance was already running and launching a new instance failed.
In my case, the error message "The operator or administrator has refused the request" appeared because the computer was in stand-by at the scheduled time (and the options "Wake the computer to run this task" and "Run task as soon as possible after a scheduled start was missed" were unchecked).
I had previously chosen "Enable All Tasks History" and a more useful error message appeared in the History tab: "Missed task start rejected: Task Scheduler did not launch task as it missed its schedule. Consider using the configuration option to start the task when available, if schedule is missed."
I have found what I believe to be a bizarre bug in Windows Server 2016 scheduler and maybe other Windows Server versions that produces the OP's error (and a workaround):
Here are the conditions:
You're using the "Monthly" option trigger in your task (I currently have all months selected and just a couple days chosen, e.g. 1st and 15th)
You have the "Synchronize across time zones" selected.
This was originally an issue I found back in November 2020 when my tasks were running twice all of a sudden after the DST time change (and this was a widely reported bug, but not an obvious solution). I never would have known, except that users started receiving duplicate emails from one of my tasks. In the history you would simply see the task running twice at what appeared to be exactly the same time. It worked fine before the time change. I forget all the troubleshooting I did then, but my end theory was that it was somehow confusing the time after the time change. The work around was to set the option "Synchronize across time zones" and all seemed well...
Fast forward to March when the DST time just changed back again and now I get every time the tasks with the Monthly option runs:
The operator or administrator has refused the request
The History tab on the task is also blank. If you change options and save, the History tab starts logging again and then sometimes stops if the task errors again. Weird.
One work around is to simply turn off the "Synchronize across time zones" option (tested). However, I don't recommend that option as I assume you'll have the duplicate running task issue again when the DST time changes again in November.
The one time I got an error to show in the History tab it stated:
Task Scheduler did not launch task "\EmailCampaign" as it missed its
schedule. Consider using the configuration option to start the task
when available, if schedule is missed.
Therefore, I went and set that option to start the task if the schedule is missed and all seems well. I figured I'd see the original error and then subsequently the task running, but no error any more either. It all just works.
I know this solution was reported above, but that's because most people's computers were asleep or something to that effect. My issue is on a production internet facing server that doesn't go to sleep, hibernate or anything related and only happens with specific conditions related to the Monthly trigger option. All my others tens of scheduled tasks work flawless.
I wrote a Powershell script to do a task. I was getting this error and landed here (as well as other lower ranked search results). The task would run manually and the first time it was triggered, but not on repeat even though I had it set up to end the task if it took longer than a minute.
My problem was caused by not providing an exit code in my powershell script. Task scheduler simply did not know the task had finished and would consider it still running. I could have simply allowed the next instance of the task to be started if the previous was not finished, but using the exit code is the 'right way'.
So I simply added a new line on the end of my PS1 --
exit
This topic is old but I had the same problem on windows server 2016.
My task executes a BAT script that zip a folder and upload on an external backup.
The task never ended because there was a "pause" at the end of my script. And my task was configured with "Dot not start a new instance" settings.
I solved my problem by removing the "pause". I don't know if it will be useful..