Why is my cloud run deploy hanging at the "Deploying..." step? - google-cloud-platform

Up until today, my deploy process has worked fine. Today when I go to deploy a new revision, I get stuck at the Deploying... text with a spinning indicator, and it says One or more of the referenced revisions does not yet exist or is deleted. I've tried a number of different images and flags -- all the same.

See Viewing the list of revisions for a service, in order to undo whatever you may have done.
Probably you have the wrong project selected, if it does not know any of the revisions.

I know I provided scant information, but just to follow up with an answer: it looks like the issue was that I was deploying a revision, and then immediately trying to tag it using gcloud alpha run services update-traffic <service_name> --set-tags which looks to have caused some sort of race, where it complained that the revision was not yet deployed, and would hang indefinitely. Moving the set-tag into the gcloud alpha run deploy seemed to fix it.

Related

Application information missing in Spinnaker after re-adding GKE accounts - using spinnaker-for-gke

I am using a Spinnaker implementation set up on GCP using the spinnaker-for-gcp tools. My initial setup worked fine. However, we recently had to re-configure our GKE clusters (independently of Spinnaker). Consequently I deleted and re-added our gke-accounts. After doing that the Spinnaker UI appears to show the existing GKE-based applications but if I click on any of them there are no clusters or load balancers listed anymore! Here are the spinnaker-for-gcp commands that I executed:
$ hal config provider kubernetes account delete company-prod-acct
$ hal config provider kubernetes account delete company-dev-acct
$ ./add_gke_account.sh # for gke_company_us-central1_company-prod
$ ./add_gke_account.sh # for gke_company_us-west1-a_company-dev
$ ./push_and_apply.sh
When the above didn't work I did an experiment where I deleted the two account and added an account with a different name (but the same GKE cluster) and ran push_and_apply. As before, the output messages seem to indicate that everything worked, but the Spinnaker UI continued to show all the old account names, despite the fact that I deleted them and added new ones (which did not show up). And, as before, not details could be seen for any of the applications. Also note that hal config provider kubernetes account list did show the new account name and did not show the old ones.
Any ideas for what I can do, other than complete recreating our Spinnaker installation? Is there anything in particular that I should look for in the Spinnaker logs in GCP to provide more information?
Thanks in advance.
-Mark
The problem turned out to be that the data that was in my .kube/config file in Cloud Shell was obsolete. Removing that file, recreating it (via the appropriate kubectl commands) and then running the commands mentioned in my original description fixed the problem.
Note, though, that it took a lot of shell script and GCP log reading by our team to figure out the problem. Ultimately, what would have been nice would have been if the add_gke_account.sh or push_and_apply.sh scripts could have detected the issue, presumably by verifying that the expected changes did, in fact, correctly occur in the running spinnaker.

MWAA - environments constantly loading

I'm currently trying to set up an Airflow environment via MWAA. I've gone through the create environment steps twice with both ending at the page listing Airflow environments with a banner saying I was successful. However, for the past 2 days, this environments page has just shown Loading Environments, as shown below. I also see a (0) for the environment number.
So far, I've added 2 interfaces for ECR and VPC for the API and the environment but no luck. Has anyone else run into this issue or have any clue what might be happening? Thanks!
Were you able to find the solution to this issue? I had similar issues when I tried to set up the first-time MWAA on AWS Account.
https://github.com/awslabs/aws-support-tools/tree/master/MWAA
Here's a link to how to verify if all the resources are set up correctly for MWAA. If you run the script mentioned on the repo you should be able to see where the issue lies.

Cloud Function build error - failed to get OS from config file for image

I'm seeing this Cloud Build error when I try to deploy a Cloud Function:
"Step #2 - "analyzer": [31;1mERROR: [0mfailed to initialize cache: failed to create image cache: accessing cache image "us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest": failed to get OS from config file for image 'us.gcr.io/MY_PROJECT/gcf/us-central1/SOME_KEY/cache:latest'"
I'm able to build and emulate the cloud function locally, but I can't deploy it due to this error. I was able to deploy just fine until now. I've looked everywhere and I can't find any discussion about this. Anyone know what's going on here?
UPDATE: I deployed a new function 3 days ago and now I can't seem to deploy an update to it. I get the same error. I'm fairly sure this is happening due to the lifecycle rule I set up to ensure I don't keep storing images of functions: Firebase storage artifacts is huge and keeps increasing. This rule is important to keep around because I don't want to pay for unnecessary storage, but it seems like it might be the source of our problem here. Can someone from Google look into this?
I got the same error, even for code that deployed successfully before.
A workaround is to delete the Docker images for the failing Firebase functions inside Container Registry and re-deploying the functions. (The images will be re-created upon deploying.)
The error still occurs sporadically, so I suspect this may be a bug introduced in Firebase's deployment process. Thankfully for now, the workaround above resolves the issue every time the error comes up.
I also encountered the same problem, and solved it by deleting the images in the Container Registry of Firebase Project.
I made a Script at that time, and I'll put it here. The usage is as follows. Please use it if you like.
Install the Google Cloud SDK.
Download the Script
Edit CONTAINER_REGISTRY to your registry name. For example: CONTAINER_REGISTRY=asia.gcr.io/project-name/gcf/asia-northeast1
Grant execute permission. - $ chmod +x script.sh
Execute it. - $ sh script.sh
Deploy your functions.
I'm having the same problem for the last few days and in contact with the support. I had the same log and in my case it wasn't connected to the artifacts because the artifacts rebuild themselves automatically on deploy (read below about a subtle case related to the artifacts and how to fix it), but deleting the functions and redeploying solved it for me.
Artifacts auto cleanup
Note that if the artifacts bucket is empty, then the problem is somewhere else.
But if it's not empty, what you can do to resolve any possible problems related to the artifacts auto cleanup, is to delete the whole "container" folder manually in the artifacts which should solve it. Then just redeploy again.
Make sure not to delete the artifacts bucket itself!
Dough from firebase confirmed in the question you referring to that removing the artifacts content is safe.
So, here is how to delete it:
go to the google cloud console, select your project -> storage -> browser https://console.cloud.google.com/storage/browser
Select the "artifacts" bucket
Choose "containers" and delete it
If the problem was here, it should work fine after that.
This happens because the deletion rule you refer to in your question checks the "last updated" timestamp of each file while on redeploy only some files are updated. So the next day the rule will delete some of the files while leaving the others which will lead to the inconsistent state of the bucket in this case. So you just remove everything manually.

Amplify Fetching is too long

aws-cli/2.1.21 Python/3.7.4 Darwin/19.6.0
amplify CLI 4.41.2
macOS 10.15.6
question:
Amplify fetching process is too long.
please help to me..
I was try below.
amplify init
amplify add api -> REST API
amplify push
Fetching started when last command.
console is
'Fetching updates to backend environment: dev from the cloud. '
I waited a hour.
but process is not complete.
Please tell me I have to confirm where.
other
Previously, I manually deleted the amplify related resource.
(e.g. cloudformation, s3,and more)
This may have been bad.
What is strange about you situation is that you are pushing information, but I'm not certain that you mentioned you had a stable cloud version. Meaning this, if there was an unstable version on the cloud and a heavily modified local version it could cause issue (such is what i have seen when working on it). My advice would be to ensure you have a cleanly design cloud version with amplify studio https://docs.amplify.aws/cli/ could help with general commands though "Amplify --help" is another option.
Aside from that, "Amplify pull" would overwrite your local system, but assuming it is a clean version you could then push it. I have actually save files off to my desktop and then copied them back in and that has worked for me. Fundamentally the issue is that it is a cloud system you are relying on. Major modifications will often be ignored or overwritten. Wish I could help more.

Cloud Composer throwing InvalidToken after adding another node

I recently added a few new DAGs to production airflow and as a result decided to scale up the number of nodes in the Composer pool. After doing so I got the error: Can't decrypt _val for key=<KEY>, invalid token or value. This happens now for every single DAG that uses variables. It's not the same key either, it depends on what variables the DAG needs.
I immediately scaled Composer back down to 3 nodes and the problem persisted.
I have tried re-saving all of the Variables, recreating them in the UI (which says they are all valid), recreating them in the CLI (which lists invalid for every single one).
I have also tried updating configuration to try and reboot the server, and manually stopping the VM instances.
Composer also seems to negate the ability to update the Fernet Key, so I can't try and use a new one. For some reason it appears that the permanent one Composer has assigned is now invalid.
Is there anything else that can be done to remedy that problem short of recreating the environment?
I managed to fix this problem by adding a new python package. It seems that adding a package is the only way to really "reboot" the environment. The reboot invalidated all of my variables and connections when it had finished but I was able to just add those back in rather than having to recreate the entire environment.
Heard back about this issue: According to Google, Composer creates a custom image for the environment and passes one to each node, and if that got corrupted during scaling then the only way to fix it is by adding a new python package so it rebuilds the image. Incidentally, version 1.3.0 of Composer is much better as the scheduler is restarted every 10 minutes which should solve some of the latter issues I experienced.