GCP Cloud functions health check failing

GCP Cloud functions health check failing - google-cloud-platform

I am trying to deploy a function in GCP for 2 days and receive the following error each time.
OperationError: code=13, message=Function deployment failed due to a health check failure. This usually indicates that your code was built successfully but failed during test execution. Examine the logs to determine the cause. Try deploying again in a few minutes if it appears to be transient.
The Log viewer doesn't give a proper explanation of the problem. Giving the following logs continuously until the deployment fails.
"Error: function terminated. Recommended action: inspect logs for termination reason. The function cannot be initialized."
Now, the interesting fact is that the same code was working couple of weeks ago.
This issue really makes me wonder that it is a bug from GCP after the recent cloud function upgrade.

I've been having the same issue. I found out part of the issue might be related to a dependency issue. In my specific case, related to the slackclient library, but it seems to come from the yarl library and the recommended fix is to define it as yarl==1.4.2
Reference:
https://github.com/slackapi/python-slackclient/issues/764

Related

Occasional failure on Amazon ECS with different error messages when starting task

We have a service running that orchestrates starting Fargate ECS tasks on messages from a RabbitMQ-queue. Sometimes tasks weirdly fail to start.
Info:
It starts a task somewhere between every other minute and every ten minutes.
It uses a set amount of task definitions. It re-uses the task definitions.
It consistently uses the same subnet in the same VPC.
The problem:
The vast majority of tasks starts fine. Say 98%. Sometimes tasks fail to start, and I get error messages. The error messages are not always the same, but they seem to be network-related.
Error messages I have gotten the last 36 hours:
'Timeout waiting for network interface provisioning to complete.'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message'
'CannotPullContainerError: ref pull has been retried 5 time(s): failed to resolve reference <image that exists in repository>: failed to do request: Head https:<account-id>.dkr.ecr.eu-west-1.amazonaws.com/v2/k1-d...'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: context deadline exceeded'
Thoughts:
It looks to me like there is a network-connectivity error of some sort.
The result of my Googling tells me that at least some of the errors can arise from having wrongly configured VPC or route-tables.
This is not the case here, I assume, since starting the exact same task with the exact same task definition in the same subnet works fine most of the time.
The ENI problem could maybe arise from me running out of ENI:s (?) on an EC2-instance, but since these tasks are started through Fargate I feel like that should not be the problem.
It seems like at least the network provisioning error can sometimes be an AWS issue.
Questions:
Why is this happening? Is it me or AWS?
Depending on the answer to the first question, is there something I can do to avoid this?
If there is nothing I can do, is there something I can do to mitigate it while it's happening? Should I simply just retry starting the task and hope that solves it?
Thanks very much in advance, I have been chasing this problem for months and feel like I am at least closing in on it, but this is as far as I can get on my own, I fear.

It is possible that tasks may fail to start due to a certain amount of reasons. Some of them are transient and are more "AWS" some others are more structural of your configuration and are more "you". For example the network time out is often due to a network misconfiguration where the task ENI does not have a proper route to the registry (e.g. Docker Hub). In all other cases it is possible that it's a transient one-off issue of the Fargate internals.
These problems may be transparent to you OR you may need to take action depending on how you use Fargate. For example, if you use Fargate tasks as part of an ECS service or an EKS deployment, the ECS/EKS routines will make sure they retry to instantiate the task to meet the service/deployment target configuration.
If you are launching the Fargate task using a one-off RunTask API call (i.e. not part of an orchestrator control loop that can monitor its failure) then it depends how you are calling that API. If you are calling it from tools such as AWS Step Functions, AWS Batch and possibly others, they all have retry mechanisms so if a task fails to launch they are smart enough to re-launch it.
However, if you are launching the task from an imperative line of code (or CLI command etc) then it's on your code to make sure the task has been launched properly and that you don't need to re-launch it upon an error message.

Google Cloud Functions - Deployment hangs for 5-10 minutes, then gives error "Deployment failure: Operation interrupted"

I'm getting errors when I try to deploy a Google Cloud Function. The process hangs for about 5-10 minutes and then an error appears:
"Deployment failure:
Operation interrupted."
I tried creating a new test function with nothing in it in two different projects of mine, both are timing out with that same error.
Anyone experiencing anything similar?

There was an incident related to Cloud Functions and Cloud Build that began at 2019-09-24 13:00 and ended at 2019-09-24 18:15 (all times are US/Pacific).
It should be all good now. Please try to deploy your function again.
In case it will not work for you. Please update your question to contain more information: minimum reproducible code, dependencies, timestamp.

Yes, having the same issue here. Tried to check status on their dashboard they mark it has ok but it's not.

Webhook call failed: URL_REJECTED error in DialogFlow v2 Fulfillments

Error description
Upon calling DialogFlow v2 detectIntent API, we randomly get an internal error with status code 13:
Webhook call failed. Fetch failure with no HTTP status code. Status: State: URL_REJECTED Reason: 67
This error seems to happen randomly. The same request can succeed or fail.
Interesting point, the service has been deteriorating since Friday 23th August 2019, to fail on almost every call today.
Our investigation
We didn't find anything at all about URL_REJECTED with DialogFlow or Google on internet.
But we found the meaning of the status code 13 on this page:
Internal errors. This means that some invariants expected by the underlying system have been broken. This error code is reserved for serious errors.
We also checked that we aren't banning Google IP, our that our load-balancing is not messed up (we thought of that since it would make sense with random fails).
The webhook is up and running, and we can call it ourselves. The problem seems to happen in Google's infra, as the error code 13 seems to show.

(I answer immediatly because we fixed it before posting the question. But I posted nevertheless because it may be useful for others)
The problem was that the webhook was called using http.
Setting https solved the problem.
It seems that Google activated a webhook policy of rejecting unsecure calls in their servers.
It may have been deployed gradually on their cluster, which would explain the gradual degradation.
We know that we should have migrated to https a long time ago, but still we didn't find any mention of the application of this policy on the net.

Thank you for posting this. I came across the same issue. Changed my webhook to HTTPS seems to fix the problem.

How to debug failed fargate task initialization

I have a fargate task which I have scheduled to run with CloudWatch Event rules, and output a timestamp to a database on a successful run. It also outputs a logfile to CloudWatch for every time it runs.
However, there was 1 time where the log file was not created, and the database not updated. I suspect the task was never even started, or had failed to start.
In CloudWatch, the event rule shows trigger and invocation at the time I expected the task to run, so I assume the task at least attempted to start.
My question is: is there any way I can debug or log information about the cluster failing to start a task?
Please let me know if I need to provide more information.
Edit: I should specify I'm looking for a way to read this information in a log file somewhere. I know I can see failed task reason in the web console, but that's only for relatively recent tasks.
I have posted the same question here: https://www.reddit.com/r/aws/comments/adtqvt/debugging_failed_fargate_task_initialization/ and StackOverflow: https://forums.aws.amazon.com/thread.jspa?messageID=884638&#884638

Go to the cluster and choose the Tasks tab
In the lower pane, choose Stopped for the Desired Task Status value
Locate the desired Task and click it's GUID
Scroll down to the Containers section and expand the relevant containers that are experiencing errors
You'll see some kind of Status reason for the error. In my case it was:
CannotStartContainerError: API error (500): failed to initialize logging driver: Cannot determine region for awslogs driver
Edit: I can't really take credit for figuring this out - found it here:
https://github.com/aws/amazon-ecs-agent/issues/1654#issuecomment-437178282

Try going to "CloudWatch -> Logs -> Insights" and click on "Run Query":

I just faced this problem and the lack of logs did make it quite difficult to resolve.
The problem in my case was the security group used for the task had been deleted. Hope this helps if any one has a similar issue.

NATS Error while developing echo service

I'm trying to develop a system service, so I use the echo service as a test.
I developed the service by following the directions on the CF doc.
Now the echo node can be running, but the echo gateway failed with the error "echo_gateway - pid=15040 tid=9321 fid=290e ERROR -- Exiting due to NATS error: Could not connect to server on nats://localhost:4222/"

I got into this issue and struck for almost a week finally someone helped me to resolve it. The underlying problemn is something else and since errors are not trapped properly it gives a wrong message. You need to goto github and get the latest code base. The fix for this issue is http://reviews.cloudfoundry.org/#/c/8891 . Once you fix this issue, you will most likely encounter a timeout field issue. the solution for that is to define the timeout field gateway.yml

A few additional properties became required in the echo_gateway.yml.erb file - specifically, the latest were default_plan and timeout, under the service group. The properties have been added to the appropriate file in the vcap-services-sample-release repo.
Looks like the fix for the misleading error has been merged into github. I haven't updated and verified this myself just yet but the gerrit comments indicate the solution is the same as what the node base has had for some time. I did previously run into that error handling and it was far more helpful.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js