Kubernetes Engine unable to pull image from non-private / GCR repository - google-cloud-platform

I was happily deploying to Kubernetes Engine for a while, but while working on an integrated cloud container builder pipeline, I started getting into trouble.
I don't know what changed. I can not deploy to kubernetes anymore, even in ways I did before without cloud builder.
The pods rollout process gives an error indicating that it is unable to pull from the registry. Which seems weird because the images exist (I can pull them using cli) and I granted all possibly related permissions to my user and the cloud builder service account.
I get the error ImagePullBackOff and see this in the pod events:
Failed to pull image
"gcr.io/my-project/backend:f4711979-eaab-4de1-afd8-d2e37eaeb988":
rpc error: code = Unknown desc = unauthorized: authentication required
What's going on? Who needs authorization, and for what?

In my case, my cluster didn't have the Storage read permission, which is necessary for GKE to pull an image from GCR.
My cluster didn't have proper permissions because I created the cluster through terraform and didn't include the node_config.oauth_scopes block. When creating a cluster through the console, the Storage read permission is added by default.

The credentials in my project somehow got messed up. I solved the problem by re-initializing a few APIs including Kubernetes Engine, Deployment Manager and Container Builder.
First time I tried this I didn't succeed, because to disable something you have to disable first all the APIs that depend on it. If you do this via the GCloud web UI then you'll likely see a list of services that are not all available for disabling in the UI.
I learned that using the gcloud CLI you can list all APIs of your project and disable everything properly.
Things worked after that.
The reason I knew things were messed up, is because I had a copy of the same things as a production environment, and there these problems did not exist. The development environment had a lot of iterations and messing around with credentials, so somewhere things got corrupted.
These are some examples of useful commands:
gcloud projects get-iam-policy $PROJECT_ID
gcloud services disable container.googleapis.com --verbosity=debug
gcloud services enable container.googleapis.com
More info here, including how to restore service account credentials.

Related

Atlantis plan erroring with querying Cloud Storage failed message

I have a GCP VM to which a GCP Service Account has been attached.
This SA has the appropriate permissions to perform some terraform / terragrunt related actions, such as querying the backend configuration GCS bucket etc.
So, when I log in to the VM (to which I have already transferred my terraform configuration files, I can for example do
$ terragrunt plan
Initializing the backend...
Successfully configured the backend "gcs"! Terraform will automatically
use this backend unless the backend configuration changes.
Initializing provider plugins...
- terraform.io/builtin/terraform is built in to Terraform
- Finding hashicorp/random versions matching "3.1.0"...
- Finding hashicorp/template versions matching "2.2.0"...
- Finding hashicorp/local versions matching "2.1.0"...
.
.
.
(...and the plan goes on)
I have now set up atlantis to run as a systemd service (under a same name user)
The problem is that when I create a PR, the plan (as posted as a PR comment) fails as follows:
Initializing the backend...
Successfully configured the backend "gcs"! Terraform will automatically
use this backend unless the backend configuration changes.
Failed to get existing workspaces: querying Cloud Storage failed: storage: bucket doesn't exist
Does anyone know (suspects) whether this problem may be related to the change the terraform service account is / can not be used by the systemd service running atlantis? (cause the bucket is there, since I am able to plan manually)
update: I have validated that a systemd service does inherit the GCP SA by creating a systemd service that just runs this script
#!/bin/bash
gcloud auth list
and this does output the SA of the VM.
So I changed my original question since this apparently is not the issue.
Posting my comment as an answer for visibility to other community members.
You were maybe getting an error because there can be an issue with the terraform configuration. To update it, Please run the following command and see if it solves your issue.
terraform init -reconfigure

We have discouraged Basic authentication in Google Kubernetes Engine (GKE)

I recently received an email from Google:
Hello Google Kubernetes Engine Customer,
We’re writing to remind you that we have discouraged Basic authentication in
Google Kubernetes Engine (GKE). This authentication strategy has been
disabled by default since version 1.12, because it does not align with
Googles’ security best practices and will no longer be supported in GKE,
starting from v1.19.
You’re receiving this message, because you’re currently using a
static password to authenticate for one or more of your GKE clusters.
How can I avoid using a static password, where is this kept? I don't remember setting this up.
I've referenced https://cloud.google.com/kubernetes-engine/docs/how-to/hardening-your-cluster. Am I right to understand I didn't do anything particularly to fall out of compliance except using the GCP automation prior to 1.12 and now need to take some sort of action to remain within current standards?
Want to ensure I understand the history and scope of this change and perhaps have a simplified video I can follow verbatim to ensure I don't fall into a downtime I can't get out of. Or just a set of commands if this is standard to maintain my current workflow and authenticate on my user that already has current access prior to 1.12 when I deploy my app.
Disabling basic authentication should not result in any downtime for your cluster.
The preferred method for authentication with the API server is OAuth and this should already be enabled for your cluster. You can check that this is working by running the following commands:
gcloud auth login
gcloud container clusters get-credentials $CLUSTERNAME --zone $ZONE
Running any kubectl command, e.g. kubectl cluster-info
Assuming all goes well there (and I can't think of any reason it will not), you'd then run
gcloud container clusters update CLUSTER_NAME --no-enable-basic-auth
in order to disable basic auth.
EDIT: If you need to (re)enable basic auth, you can run the foolowing command:
gcloud container clusters update --username=$USER --password=$PASS
where $USER and $PASS are the username and password you were previously using (or a new user/password if you choose).
Of course, if you have any automations using the Cloud SDK which use basic auth, you'd need to update those as well.

gcloud builds submit fails while docker push + gcloud run deploy work just fine?

EDIT: The so called duplicate question was way off since 1. I could push another image and 2. I could not push a build image. Finally, point #3 is the solution was totally different and ONLY related to pushing build images via cloudbuild. ie. I beg to differ that this question WAS different.
Running into some more google cloud security stuff. We currently deploy to cloud run like so
docker build . --tag gcr.io/myproject/authservice
docker push gcr.io/myproject/authservice
gcloud run deploy staging-admin --region us-west1 --image gcr.io/myproject/authservice --platform managed
I did the quick start for google builds but I am getting permission errors. I did this command
https://cloud.google.com/cloud-build/docs/quickstart-build
The command I ran was
gcloud builds submit --tag gcr.io/myproject/quickstart-image
This is all the same project but submitting builds gets this same error over and over and over(I am not sure why it doesn't just exit on first error.
The push refers to repository [gcr.io/myproject/quickstart-image]
e3831abe9997: Preparing
60664c29ef5a: Preparing
denied: Token exchange failed for project 'myproject'. Caller does not have permission 'storage.buckets.get'. To configure permissions, follow instructions at: https://cloud.google.com/container-registry/docs/access-control
Any ideas how to fix so I can use google cloud build?
Complementing the previous answer, as is mentioned in this document to perform actions in Container Registry the role "sotrage admin" is necessary
Do you have "roles/storage.admin" role? If not, add it and try.
The Could build service account has this format [project_number]#cloudbuild.gserviceaccount.com please add the role "roles/storage.admin" by following this steps
Open the Cloud IAM page
Select your Cloud project.
In the permissions table, locate the row with the email address
ending with #cloudbuild.gserviceaccount.com. This is your Cloud
Build service account.
Click on the pencil icon.
Select the role you wish to grant to the Cloud Build service
account.
Click Save.
BE WARNED: I read the duplicate question post but in my case
I can push items
only the build one is failing AND the solution I found is different than any of the other question answers
This was a VERY weird issue. The storage permission MUST be a red herring because these permissions fixed the issue
I found some documentation somewhere that I can't seem to find on a google github repo about adding these permissions AND a document on the TWO #cloudbuild.gserviceaccount.com accouts AND you must add the permissions to the correct one!!!! One is owned by google and you should not touch.
In my case, the permission / token exchange failed error was caused by having the storage bucket used by Google Container Registry inside a VPC Service Perimeter.
This can be checked / confirmed via the VPC Service Controls logs - accessible easily from the troubleshooting page.
There is a (very clunky) way to get Cloud Build working to push images to a registry inside a VPC perimeter. It involves running a build worker pool and applying appropriate config + permissions to the perimeter etc.

Is VPC-native GKE cluster production ready?

This happens while trying to create a VPC-native GKE cluster. Per the documentation here the command to do this is
gcloud container clusters create [CLUSTER_NAME] --enable-ip-alias
However this command, gives below error.
ERROR: (gcloud.container.clusters.create) Only alpha clusters (--enable_kubernetes_alpha) can use --enable-ip-alias
The command does work when option --enable_kubernetes_alpha is added. But gives another message.
This will create a cluster with all Kubernetes Alpha features enabled.
- This cluster will not be covered by the Container Engine SLA and
should not be used for production workloads.
- You will not be able to upgrade the master or nodes.
- The cluster will be deleted after 30 days.
Edit: The test was done in zone asia-south1-c
My questions are:
Is VPC-Native cluster production ready?
If yes, what is the correct way to create a production ready cluster?
If VPC-Native cluster is not production ready, what is the way to connect privately from a GKE cluster to another GCP service (like Cloud SQL)?
Your command seems correct. Seems like something is going wrong during the creation of your cluster on your project. Are you using any other flags than the command you posted?
When I set my Google cloud shell to region europe-west1
The cluster deploys error free and 1.11.6-gke.2(default) is what it uses.
You could try to manually create the cluster using the GUI instead of gcloud command. While creating the cluster, check the “Enable VPC-native (using alias ip)” feature. Try using a newest non-alpha version of GKE if some are showing up for you.
Public documentation you posted on GKE IP-aliasing and the GKE projects.locations.clusters API shows this to be in GA. All signs point this to be production ready. For whatever it’s worth, the feature has been posted last May In Google Cloud blog.
What you can try is to update your version of Google Cloud SDK. This will bring everything up to the latest release and remove alpha messages for features that are in GA right now.
$ gcloud components update

"gcloud container clusters create" command throws "error Required 'compute.networks.get'"

I want to create GKE clusters by gcloud command. But I cannot solve this error:
$ gcloud container clusters create myproject --machine-type=n1-standard1# --zone=asia-northeast1-a
ERROR: (gcloud.container.clusters.create) ResponseError: code=403, message=Google
Compute Engine: Required 'compute.networks.get' permission for
'projects/myproject/global/networks/default'
cloud account linked to my gmail is owner of the project and relative powers, so I anticipate that there is no problem about permissions.
When you create a cluster though $ gcloud container clusters create command you should keep in mind that there are hundreds of operations hidden.
When you have the owner rights then you are able to give the initial "Kick" to the process to make everything start. At this point Service accounts starts to enter in the process and they taking care of creating all the resource for you, automatically.
These service account have different powers and permissions (that can be customised) in order to limit the attack surface in case of one of them is compromise and to keep a sort of order, you will have for example ****-compute#developer.gservuceaccount.com that is a Default compute engine service account.
When you enable different the API some of these service accounts can be created in order to make the components work as expected, but if one of them is deleted or modified you might face one of the error that you are experiencing.
Usually the easiest way to solve the issue is recreate the service account for example deleting it and disabling an enabling the corresponting API.
For example when you enable Kubernetes engine service-****#container-engine-robot-iam-gaservice account is created
In my test project for example I modified them removing the "Kubernetes Engine service Agent" permission and I modified as well the Google APIs service account setting it as a "project viewer" and I am facing permission issues both creating and deleting clusters.
You can navigate through IAM&Amin-->admin to check the status and which service accounts are at the moment authorised in your project.
Here you can find a more deep explanation of some default service accounts.
Here you can find a small guide regarding how to re-enable Kubernetes Engine's default service account:
"If you remove this role binding from the service account, the default service account becomes unbound from the project, which can prevent you from deploying applications and performing other cluster operations."