GCloud run deploy: Internal system error, system will retry - google-cloud-platform

Until now I was easily able to easily deploy to cloud run and I was deploying many times a day.
Just now I got into this:
I tried deleting the service and creating it again. There are no errors in logs and it seems to be stuck without actually doing anything.
This also happens for me in some other, unrelated project.
Is there somewhere else I can have a look?

There is an outage announced.
Issue with Cloud Run deployments failing globally starting at
Thursday, 2020-10-29 12:45 US/Pacific.

Related

Data flow pipeline got stuck

Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h. Please check the worker logs in Stackdriver Logging. You can also get help with Cloud Dataflow at https://cloud.google.com/dataflow/support.
I am using service account with all required IAM roles
Generally The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h can be caused by too long setup progress. In order to solve this issue you can try to increase worker resources (via --machine_type parameter) to overcome the issue.
For example, While installing several dependencies that required building wheels (pystan, fbprophet) which will take more than an hour on the minimal machine (n1-standard-1 with 1 vCPU and 3.75GB RAM). Using a more powerful instance (n1-standard-4 which has 4 times more resources) will solve the problem.
You can debug this by looking at the worker startup logs in cloud logging. You are likely to see pip issues with installing dependencies.
Do you have any error logs showing that Dataflow Workers are crashing when trying to start?
If not, maybe worker VMs are started but they can't reach the Dataflow service, which is often related to network connectivity.
Please note that by default, Dataflow creates jobs using the network and subnetwork default (please check if it exists on your project), and you can change to a specific one by specifying --subnetwork. Check https://cloud.google.com/dataflow/docs/guides/specifying-networks for more information.

Cloud Function build error - failed to get OS from config file for image

I'm seeing this Cloud Build error when I try to deploy a Cloud Function:
"Step #2 - "analyzer": [31;1mERROR: [0mfailed to initialize cache: failed to create image cache: accessing cache image "us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest": failed to get OS from config file for image 'us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest'"
I'm able to build and emulate the cloud function locally, but I can't deploy it due to this error. I was able to deploy just fine until now. I've looked everywhere and I can't find any discussion about this. Anyone know what's going on here?
UPDATE: I deployed a new function 3 days ago and now I can't seem to deploy an update to it. I get the same error. I'm fairly sure this is happening due to the lifecycle rule I set up to ensure I don't keep storing images of functions: Firebase storage artifacts is huge and keeps increasing. This rule is important to keep around because I don't want to pay for unnecessary storage, but it seems like it might be the source of our problem here. Can someone from Google look into this?
I got the same error, even for code that deployed successfully before.
A workaround is to delete the Docker images for the failing Firebase functions inside Container Registry and re-deploying the functions. (The images will be re-created upon deploying.)
The error still occurs sporadically, so I suspect this may be a bug introduced in Firebase's deployment process. Thankfully for now, the workaround above resolves the issue every time the error comes up.
I also encountered the same problem, and solved it by deleting the images in the Container Registry of Firebase Project.
I made a Script at that time, and I'll put it here. The usage is as follows. Please use it if you like.
Install the Google Cloud SDK.
Download the Script
Edit CONTAINER_REGISTRY to your registry name. For example: CONTAINER_REGISTRY=asia.gcr.io/project-name/gcf/asia-northeast1
Grant execute permission. - $ chmod +x script.sh
Execute it. - $ sh script.sh
Deploy your functions.
I'm having the same problem for the last few days and in contact with the support. I had the same log and in my case it wasn't connected to the artifacts because the artifacts rebuild themselves automatically on deploy (read below about a subtle case related to the artifacts and how to fix it), but deleting the functions and redeploying solved it for me.
Artifacts auto cleanup
Note that if the artifacts bucket is empty, then the problem is somewhere else.
But if it's not empty, what you can do to resolve any possible problems related to the artifacts auto cleanup, is to delete the whole "container" folder manually in the artifacts which should solve it. Then just redeploy again.
Make sure not to delete the artifacts bucket itself!
Dough from firebase confirmed in the question you referring to that removing the artifacts content is safe.
So, here is how to delete it:
go to the google cloud console, select your project -> storage -> browser https://console.cloud.google.com/storage/browser
Select the "artifacts" bucket
Choose "containers" and delete it
If the problem was here, it should work fine after that.
This happens because the deletion rule you refer to in your question checks the "last updated" timestamp of each file while on redeploy only some files are updated. So the next day the rule will delete some of the files while leaving the others which will lead to the inconsistent state of the bucket in this case. So you just remove everything manually.

Why is my cloud run deploy hanging at the "Deploying..." step?

Up until today, my deploy process has worked fine. Today when I go to deploy a new revision, I get stuck at the Deploying... text with a spinning indicator, and it says One or more of the referenced revisions does not yet exist or is deleted. I've tried a number of different images and flags -- all the same.
See Viewing the list of revisions for a service, in order to undo whatever you may have done.
Probably you have the wrong project selected, if it does not know any of the revisions.
I know I provided scant information, but just to follow up with an answer: it looks like the issue was that I was deploying a revision, and then immediately trying to tag it using gcloud alpha run services update-traffic <service_name> --set-tags which looks to have caused some sort of race, where it complained that the revision was not yet deployed, and would hang indefinitely. Moving the set-tag into the gcloud alpha run deploy seemed to fix it.

What's the gcloud equivalent of appcfg rollback?

The GCP command appcfg has been deprecated. appcfg used to have appcfg rollback to be used when there is a failed deployment.
What is its equivalent for gcloud (the new command)? I can't find it in Google GCP documentation.
More context:
Rolling back in appcfg was not meant for managing the traffic and going back to the previous version. It was used to remove the lock on your deploy.
If you had an unsuccessful deployment, you were not able to deploy any more. appcfg rollback was used to remove that lock and make it available for you to deploy again.
I think there is no direct command to appcfg rollback. However, I would highly recommend you to consider the Splitting the traffic option.
This will let you redirect the traffic from one of your versions to another, even between old versions of your service.
Let's imagine this:
You have version 1 of your service and it works just fine.
A couple weeks later you decide to deploy a new the version: version 2
However, the deploy fails and your app is completely down. You are loosing users and money. Everything is on fire.
You can easily switch the traffic to the trusty version 1 by redirecting 100% of the traffic to it.
Version 2 is out of the game until you deploy a new version.
The advantage of this is that you don't have to wait until the rollback is done. The traffic is automatically redirected to an old version. Additionally, it has the gcloud set traffic command for you to run it via the CLI.
Hope this is helpful!

Deploying cloud foundry with BOSH - what does a "bosh delete deployment" clean up?

We've just gone through the process of deploying a multi node (34 nodes) cloud foundry using BOSH, with a few hiccups along the way. One in particular was that it took us several "bosh deploy" runs to get through the initial compilation steps. We'd start the bosh deploy, it would start compiling, get through a few components and then fail. There is no doubt that we have some configuration issues with our VMWare based infrastructure and I suspect we are running out of resources. But here is my main question for now.
We were able to get through the compiles by issuing a "bosh delete deployment ourcloud --force" after a failure.
What does this command clear out? It obviously left successfully compiled stuff in place, but what is cleaned? Temporary storage? Anything else?
Thanks.
bosh delete deployment will delete an entire deployment, it deletes the VMs in the vcenter, clears it's db of the info and deletes the manifest. after it's done there should be no trace (except logs) of the deployment.