What causes 'Cloud Run error: Internal system error, system will retry later'? Suggestions for troubleshooting? - google-cloud-platform

I'm attempting to deploy a Cloud Run Service as part of tests for my open source project. This is done via our automated CI/CD system and has worked successfully hundreds of times previously.
The Cloud Run Service gets created but the first revision never gets deployed. When I look at the newly created Service in the GCP Console, it shows "Cloud Run error: Internal system error, system will retry later." as the main status message for the Service.
The command line that is failing is:
gcloud --configuration=adapt-cloud-gcloud-testing --quiet run deploy cloud-run-gen-name-a179e65d6fdfc19abc57e15df563d8cb --platform=managed --format=json --no-allow-unauthenticated --memory=128M --cpu=1 --image=gcr.io/adapt-ci/http-echo --region=us-central1 --port=5678 --set-env-vars=ADAPT_TEST_DEPLOY_ID=MockDeploy-aymb --args="-text,Adapt Test"
The output from that command (note: the dots after Creating Revision just keep going):
Deploying container to Cloud Run service [cloud-run-gen-name-a179e65d6fdfc19abc57e15df563d8cb] in project [adapt-ci] region [us-central1]
Deploying new service...
Creating Revision....................................................................................................................
The YAML tab in the Console also shows the same message for each of the three status conditions (see below).
To troubleshoot, I have also tried:
Using the GCP Console to create the most basic Cloud Run Service using the example container from the getting started docs manually, while logged in as the project and organization owner. I see the same failure. I have created Services manually this way previously, with this account and project, with no issues.
Using the GCP Console to create the same example Service as above in a different project, but with the same user and in the same org. This works successfully, so the issue is specific to the project.
I tried two different US regions with the same results.
Since this is typically automated, I attempted to look for any exceeded quotas. On the Cloud Run quotas page and the overall quotas page, I don't see any exceeded quotas now or historically. However, this is an area I'm not super familiar with, so may have missed something.
Retrying dozens of times over the course of two days.
The GCP status page shows no outages.
What are additional troubleshooting steps I should take to investigate & fix this issue?
Partial info from the YAML tab in the GCP Console for the failing Service:
status:
observedGeneration: 1
conditions:
- type: Ready
status: Unknown
message: 'Cloud Run error: Internal system error, system will retry later.'
lastTransitionTime: '2020-10-08T21:07:20.844314Z'
- type: ConfigurationsReady
status: Unknown
message: 'Cloud Run error: Internal system error, system will retry later.'
lastTransitionTime: '2020-10-08T21:07:20.755212Z'
- type: RoutesReady
status: Unknown
message: 'Cloud Run error: Internal system error, system will retry later.'
lastTransitionTime: '2020-10-08T21:07:20.844314Z'
latestCreatedRevisionName: cloud-run-gen-name-3bab80f75cfd57cf87ad89d9d2c18ba3-00001-fus

After quite a bit of trial and error, I got everything working again.
The first thing I did that made some progress was to disable the Cloud Run Admin API and re-enable it. After that change, I was able to create a service using the example container from the Console, logged in as the project owner. I was also able to create a service using the example container from the CLI, logged in as the CI service account. However, the original command from my question still had identical behavior as before. I have no idea how the project got in this state, such that the project owner couldn't use Cloud Run.
The second thing I did was to re-push the container image I was trying to use (gcr.io/adapt-ci/http-echo) to GCR. I pushed the exact same image as was there previously. This finally allowed the CI system to successfully create the Service.
As part of my earlier troubleshooting, I had looked at Google Container Registry for this project and had confirmed that the needed image was still present. However, we had somewhat recently enabled a lifecycle policy on the Cloud Storage bucket to delete items older than a certain amount of time. So my best guess is that policy deleted some, but not all of the files associated with the gcr.io/adapt-ci/http-echo image and this resulted in the internal error instead of an error saying that the container image couldn't be found.

Related

Dataproc custom image: Cannot complete creation

For a project, I have to create a Dataproc cluster that has one of the outdated versions (for example, 1.3.94-debian10) that contain the vulnerabilities in Apache Log4j 2 utility. The goal is to get the alert related (DATAPROC_IMAGE_OUTDATED), in order to check how SCC works (it is just for a test environment).
I tried to run the command gcloud dataproc clusters create dataproc-cluster --region=us-east1 --image-version=1.3.94-debian10 but got the following message ERROR: (gcloud.dataproc.clusters.create) INVALID_ARGUMENT: Selected software image version 1.3.94-debian10 is vulnerable to remote code execution due to a log4j vulnerability (CVE-2021-44228) and cannot be used to create new clusters. Please upgrade to image versions >=1.3.95, >=1.4.77, >=1.5.53, or >=2.0.27. For more information, see https://cloud.google.com/dataproc/docs/guides/recreate-cluster, which makes sense, in order to protect the cluster.
I did some research and discovered that I will have to create a custom image with said version and generate the cluster from that. The thing is, I have tried to read the documentation or find some tutorial, but I still can't understand how to start or to run the file generate_custom_image.py, for example, since I am not confortable with cloud shell (I prefer the console).
Can someone help? Thank you

Creating Google Cloud Image fails with "Could not fetch resource: Internal error"

I'm trying to set up a private Redash instance with Google Cloud. Step 1 is to add the the Redash image to your account so you can boot a VM with it.
When adding the image through Google Cloud Shell, my shell times out before the process completes.
When adding the image through the Console UI, it loads and loads then disappears without a trace.
When adding an image through gcloud CLI, I finally get a response:
➜ gcloud compute images create "redash" --source-uri gs://redash-images/redash.8.0.0-b32245-1.tar.gz
ERROR: (gcloud.compute.images.create) Could not fetch resource:
- Internal error. Please try again or contact Google Support. (Code: '-527xxxxxxxxxx759')
(x = hidden number)
I have extremely slow internet, so I'm thinking this could potentially be the issue. I've contacted Google Support but no response.
I reproduced and executed the command gcloud compute images create "redash" --source-uri gs://redash-images/redash.8.0.0-b32245-1.tar.gz. Even for me also it seems it is taking more time to execute, then I have killed it using CRTL + C but when I checked the Compute Engine > Images , a redash image is created with the same timestamp I had executed the command. With this experiment I assume that even though command is interrupted, the image creation may run in background. I suggest you to check the images section in Compute Engine once.
This is an issue of Redash's GCP account payment status.
If this issue reproduced, I recommend to tell redash admin to check their GCP account payment status.
The following URL talks about this issue with the Redash community.
https://discuss.redash.io/t/cant-pull-redash-image-on-google-cloud/9486

Error: Asset 'webhooks/ActionsOnGoogleFulfillment' cannot be deployed

I wanted to build a Google assistant with custom actions using actions-sdk. Since I am new to this, I have followed the steps in the tutorial "Build Actions for Google Assistant using Actions SDK (Level 1)" as it is, inorder to build a sample assistant. I followed the tutorial as it is. However, in step 5(Implement fulfillment), when trying to test the the fulfillment by running the command
gactions deploy preview
I am getting the below output in the terminal with error
Sending configuration files...
Sending resources...
Waiting for server to respond. It could take up to 1 minute if your cloud function needs to be redeployed.
[ERROR] Server did not return HTTP 200.
{
"error": {
"code": 400,
"message": "Asset 'webhooks/ActionsOnGoogleFulfillment' cannot be deployed. [An operation on function cf-_CcGD8lKs_F_LHmFYfJZsQ-name in region us-central1 in project <my-project-id> is already in progress. Please try again later.]"
}
}
And when I checked the "Google Cloud Platform -> Cloud Functions Console" for this project, the following is seen.
Image 1(Screenshot)
Cloud Platform Cloud Functions Console
A failed deploy of cloud function with an exclamation mark. And if I delete that functions, then immediately a new function is deployed automatically. But instead of an exclamation mark, a spinning wheel symbol(loading/still deploying) mark is present. I cannot delete that cloud function if it is still loading/deploying. Then after 10-15 min, the spinning symbol changes to exclamation symbol. And then if I delete it, then again a new one automatically appears. And it goes on like this
Image 2 (Screenshot)
Cloud Platform Cloud Function Console
This problem arises only when implementing a webhook/fulfillment(Step 5). For static Actions' response, it successfully deploys for testing on entering the command "gactions deploy preview".(Step 1 to Step 4 are successfully implemented)
I have followed the tutorial as it is, hence the code and directory structure is the same as in tutorial,(only the project-id or actions-console project name will be different).
Github Repository for Code
Since, this is only for the tutorial, at present I am not using a billing account, instead did the following changes in package.json(changed node version from 10 to 8.).
"engines": {
"node": "8"
},
Due to this continuous automatic failed deployment, when I try to explicitly deploy the project, as mentioned above, this error occurs.
"An operation on function cf-_CcGD8lKs_F_LHmFYfJZsQ-name in region us-central1 in project <my-project-id> is already in progress. Please try again later".
Can anyone please suggest how to stop this continuous automatic failed deployment of the cloud functions, so that the function I deploy will be successfully deployed? Would really appreciate your help.
(Note: This is the first time I have posted a question in stack overflow, so please let me know if there are any mistakes or stack overflow question conventions I might not have followed. I will improve it.)
Posting this as Community Wiki as it's based in the comments.
As clarified the issue seems to be the billing account, as the tutorial mentions that it's necessary to have one set for the Cloud Functions to be deployed correctly. Besides that, it's not possible to deploy Cloud Functions (webhooks) without a billing account, so yes, even though that you are not using Node.js 10, you will need to have a billing account configured for your project.
To summarize, billing account will be needed to avoid any possible deployment failure, even if you are not using Node.js 10, as explained in the followed tutorial.

Can not deploy any revisions (old or first) to Cloud Run

"Cloud Run error: Internal system error."
This errors and only this errors keeps coming when trying to deploy a revision (new or first)
What is going on with Cloud Run?
Their page won't load from GCP (though I can get in via google search) and I cannot deploy any revision without getting this error
The container works locally
It seems to be a temporary issue in the platform. You can check it in google cloud status webpage.
We've received a report of an issue with Cloud Run.
Both Cloud Run and Cloud Source Repository seems to be affected.
Usually google's team is quick in fixing whatever happened, you can find more info here as soon as something starts to move :)
Change the Docker image ENTRYPOINT value or set those fields:

Cannot delete deployment from google cloud

I'm trying to create a couple of deployment templates for airflow on GCP / Kubernetes. In that deployment, I seek to deploy all dependent managed services together with some required users and passwords.
I've been able to deploy the services, but it complained about a missing "host" parameter when creating two user. This type is documented here and shouldn't really complain, because host is listed as optional:
https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/users/insert
So I attempted to delete the deployment, but it's never letting me finish that and it's blocking on the two resources that it can probably never delete now. This is what I get in the console:
$ gcloud deployment-manager deployments delete airflow-on-k8s
The following deployments will be deleted:
- airflow-on-k8s
Do you want to continue (y/N)? y
Waiting for delete [operation-1502140582303-556305bcf9519-0af00aa8-d01c8bf6]...failed.
ERROR: (gcloud.deployment-manager.deployments.delete) Delete operation operation-1502140582303-556305bcf9519-0af00aa8-d01c8bf6 failed.
Error in Operation [operation-1502140582303-556305bcf9519-0af00aa8-d01c8bf6]: errors:
- code: RESOURCE_ERROR
location: /deployments/airflow-on-k8s/resources/root-user
message: '{"ResourceType":"sqladmin.v1beta4.user","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"errors":[{"domain":"global","location":"host","locationType":"parameter","message":"Required
parameter: host","reason":"required"}],"message":"Required parameter: host","statusMessage":"Bad
Request","requestPath":"https://www.googleapis.com/sql/v1beta4/projects/<...>/instances/airflow-db-instance4/users"}}'
- code: RESOURCE_ERROR
location: /deployments/airflow-on-k8s/resources/regular-airflow-user
message: '{"ResourceType":"sqladmin.v1beta4.user","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"errors":[{"domain":"global","location":"host","locationType":"parameter","message":"Required
parameter: host","reason":"required"}],"message":"Required parameter: host","statusMessage":"Bad
Request","requestPath":"https://www.googleapis.com/sql/v1beta4/projects/<...>/instances/airflow-db-instance4/users"}}'
Probably a bug in the API, but if anyone knows of a way, let me know. Also I heard some googlers hang out on stackoverflow and could potentially forward this to the API developers.
I had a similar problem deleting my deployment. I ended up deleting the resources by hand, and just abandoned the deployment:
gcloud deployment-manager deployments delete <deployment name> --delete-policy=ABANDON
I haven't seen any bugs reported around this, by the way: https://issuetracker.google.com/issues?q=sqladmin.v1beta4.user%20%22Required%20parameter:%20host%22 .