My cluster upgrade failed:
gcloud beta container operations describe "operation-sdfsdfsdf" --zone us-central1
clusterConditions:
- canonicalCode: UNKNOWN
message: DeployPatch failed
detail: DeployPatch failed
endTime: '2022-06-30T12:36:48.246662261Z'
error:
code: 2
message: DeployPatch failed
name: operation-sdfsdfsdf
operationType: UPGRADE_NODES
progress:
metrics:
- intValue: '7'
name: NODES_TOTAL
- intValue: '1'
name: NODES_FAILED
- intValue: '6'
name: NODES_COMPLETE
- intValue: '7'
name: NODES_DONE
- intValue: '2454'
name: NODE_PDB_DELAY_SECONDS
selfLink: https://container.googleapis.com/v1beta1/projects/xxxxxx/locations/us-central1/operations/operation-sdfsdfsdf
startTime: '2022-06-30T10:36:14.709547456Z'
status: DONE
statusMessage: DeployPatch failed
targetLink: https://container.googleapis.com/v1beta1/projects/xxxxxx/locations/us-central1/clusters/mycluster/nodePools/mypool
zone: us-central1
This is the only thing I've been able to find with this same error: Fail to enable Workload identity on GKE
I searched github as well. This error is no where in google's docs.
I also see zero error logs on the nodes.
Related
I'm following this AWS documentation which explains how to properly configure AWS Secrets Manager to let it works with EKS through Kubernetes Secrets.
I successfully followed step by step all the different commands as explained in the documentation.
The only difference I get is related to this step where I have to run:
kubectl get po --namespace=kube-system
The expected output should be:
csi-secrets-store-qp9r8 3/3 Running 0 4m
csi-secrets-store-zrjt2 3/3 Running 0 4m
but instead I get:
csi-secrets-store-provider-aws-lxxcz 1/1 Running 0 5d17h
csi-secrets-store-provider-aws-rhnc6 1/1 Running 0 5d17h
csi-secrets-store-secrets-store-csi-driver-ml6jf 3/3 Running 0 5d18h
csi-secrets-store-secrets-store-csi-driver-r5cbk 3/3 Running 0 5d18h
As you can see the names are different, but I'm quite sure it's ok :-)
The real problem starts here in step 4: I created the following YAML file (as you ca see I added some parameters):
apiVersion: secrets-store.csi.x-k8s.io/v1alpha1
kind: SecretProviderClass
metadata:
name: aws-secrets
spec:
provider: aws
parameters:
objects: |
- objectName: "mysecret"
objectType: "secretsmanager"
And finally I created a deploy (as explain here in step 5) using the following yaml file:
# test-deployment.yaml
kind: Pod
apiVersion: v1
metadata:
name: nginx-secrets-store-inline
spec:
serviceAccountName: iamserviceaccountforkeyvaultsecretmanagerresearch
containers:
- image: nginx
name: nginx
volumeMounts:
- name: mysecret-volume
mountPath: "/mnt/secrets-store"
readOnly: true
volumes:
- name: mysecret-volume
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "aws-secrets"
After the deployment through the command:
kubectl apply -f test-deployment.yaml -n mynamespace
The pod is not able to start properly because the following error is generated:
Error from server (BadRequest): container "nginx" in pod "nginx-secrets-store-inline" is waiting to start: ContainerCreating
But, for example, if I run the deployment with the following yaml the POD will be successfully created
# test-deployment.yaml
kind: Pod
apiVersion: v1
metadata:
name: nginx-secrets-store-inline
spec:
serviceAccountName: iamserviceaccountforkeyvaultsecretmanagerresearch
containers:
- image: nginx
name: nginx
volumeMounts:
- name: keyvault-credential-volume
mountPath: "/mnt/secrets-store"
readOnly: true
volumes:
- name: keyvault-credential-volume
emptyDir: {} # <<== !! LOOK HERE !!
as you can see I used
emptyDir: {}
So as far I can see the problem here is related to the following YAML lines:
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: "aws-secrets"
To be honest it's even not clear in my mind what's happing here.
Probably I didn't properly enabled the volume permission in EKS?
Sorry but I'm a newbie in both AWS and Kubernetes configurations.
Thanks for you time
--- NEW INFO ---
If I run
kubectl describe pod nginx-secrets-store-inline -n mynamespace
where nginx-secrets-store-inline is the name of the pod, I get the following output:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 30s default-scheduler Successfully assigned mynamespace/nginx-secrets-store-inline to ip-10-0-24-252.eu-central-1.compute.internal
Warning FailedMount 14s (x6 over 29s) kubelet MountVolume.SetUp failed for volume "keyvault-credential-volume" : rpc error: code = Unknown desc = failed to get secretproviderclass mynamespace/aws-secrets, error: SecretProviderClass.secrets-store.csi.x-k8s.io "aws-secrets" not found
Any hints?
Finally I realized why it wasn't working. As explained here, the error:
Warning FailedMount 3s (x4 over 6s) kubelet, kind-control-plane MountVolume.SetUp failed for volume "secrets-store-inline" : rpc error: code = Unknown desc = failed to get secretproviderclass default/azure, error: secretproviderclasses.secrets-store.csi.x-k8s.io "azure" not found
is related to namespace:
The SecretProviderClass being referenced in the volumeMount needs to exist in the same namespace as the application pod.
So both the yaml file should be deployed in the same namespace (adding, for example, the -n mynamespace argument).
Finally I got it working!
I am experimenting with deployment manager and each time I try to deploy an SQL instance with a DB on it and 2 users; some of the tasks are failing. Most of the time they are the users:
conf.yaml:
resources:
- name: mycloudsql
type: gcp-types/sqladmin-v1beta4:instances
properties:
name: mycloudsql-01
backendType: SECOND_GEN
instanceType: CLOUD_SQL_INSTANCE
databaseVersion: MYSQL_5_7
region: europe-west6
settings:
tier: db-f1-micro
locationPreference:
zone: europe-west6-a
activationPolicy: ALWAYS
dataDiskSizeGb: 10
- name: mydjangodb
type: gcp-types/sqladmin-v1beta4:databases
properties:
name: django-db-01
instance: $(ref.mycloudsql.name)
charset: utf8
- name: sqlroot
type: gcp-types/sqladmin-v1beta4:users
properties:
name: root
host: "%"
instance: $(ref.mycloudsql.name)
password: root
- name: sqluser
type: gcp-types/sqladmin-v1beta4:users
properties:
name: user
instance: $(ref.mycloudsql.name)
password: user
Error:
PS C:\Users\user\Desktop\Python\GCP> gcloud --project=sound-catalyst-263911 deployment-manager deployments create dm-sql-test-11 --config conf.yaml
The fingerprint of the deployment is TZ_wYom9Q64Hno6X0bpv9g==
Waiting for create [operation-1589869946223-5a5fa71623bc9-1912fcb9-bc59aafc]...failed.
ERROR: (gcloud.deployment-manager.deployments.create) Error in Operation [operation-1589869946223-5a5fa71623bc9-1912fcb9-bc59aafc]: errors:
- code: RESOURCE_ERROR
location: /deployments/dm-sql-test-11/resources/sqluser
message: '{"ResourceType":"gcp-types/sqladmin-v1beta4:users","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Precondition
check failed.","status":"FAILED_PRECONDITION","statusMessage":"Bad Request","requestPath":"https://www.googleapis.com/sql/v1beta4/projects/sound-catalyst-263911/instances/mycloudsql-01/users","httpMethod":"POST"}}'
- code: RESOURCE_ERROR
location: /deployments/dm-sql-test-11/resources/sqlroot
message: '{"ResourceType":"gcp-types/sqladmin-v1beta4:users","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Precondition
check failed.","status":"FAILED_PRECONDITION","statusMessage":"Bad Request","requestPath":"https://www.googleapis.com/sql/v1beta4/projects/sound-catalyst-263911/instances/mycloudsql-01/users","httpMethod":"POST"}}'
Console View:
It doesn`t say what that precondition failing is or am I missing something?
It seems the installation of database is not completed by the time the Deployment Manager starts to create users despite the reference notation is used in the YAML code to take care of dependencies. That is why you receive the "FAILED_PRECONDITION" error.
As a workaround you can split the deployment into two parts:
Create a CloudSQL instance and a database;
Create users.
This does not look elegant, but it works.
Alternatively, you can consider using Terraform. Fortunately, Cloud Shell instance is provided with Terraform pre-installed. There are sample Terraform code for Cloud SQL out there, for example this one:
CloudSQL deployment with Terraform
I am creating a yaml config to deploy a gke cluster with multi-node-pool. I like to be able to create a new cluster and put each node-pool in a different subnetwork. Can this be done.
I have tried putting the subnetwork in different part of the properties under the second node-pool but it errors out. Below is the following error.
message: '{"ResourceType":"gcp-types/container-v1:projects.locations.clusters.nodePools","ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message":"Invalid
JSON payload received. Unknown name \"subnetwork\": Cannot find field.","status":"INVALID_ARGUMENT","details":[{"#type":"type.googleapis.com/google.rpc.BadRequest","fieldViolations":[{"description":"Invalid
JSON payload received. Unknown name \"subnetwork\": Cannot find field."}]}],"statusMessage":"Bad
The current code for the both node-pools. first node is creates but second one error out.
resources:
- name: myclus
type: gcp-types/container-v1:projects.locations.clusters
properties:
parent: projects/[PROJECT_ID]/locations/[ZONE/REGION]
cluster:
name: my-clus
zone: us-east4
subnetwork: dev-web ### leave this field blank if using the default network
initialClusterVersion: "1.13"
nodePools:
- name: my-clus-pool1
initialNodeCount: 1
config:
machineType: n1-standard-1
imageType: cos
oauthScopes:
- https://www.googleapis.com/auth/cloud-platform
preemptible: true
- name: my-clus
type: gcp-types/container-v1:projects.locations.clusters.nodePools
properties:
parent: projects/[PROJECT_ID]/locations/[ZONE/REGION]/clusters/$(ref.myclus.name)
subnetwork: dev-web ### leave this field blank if using the default
nodePool:
name: my-clus-pool2
initialNodeCount: 1
version: "1.13"
config:
machineType: n1-standard-1
imageType: cos
oauthScopes:
- https://www.googleapis.com/auth/cloud-platform
preemptible: true
I like the expected out come to have 2 node-pools in 2 different subnetworks.
I found out that this is actually not a limitation of Deployment Manager but a limitation of GKE.
We can’t assign a different subnet to different node pools, the network and subnets are defined at the cluster level. There is no “Subnetwork” field in the node pool API.
Here is a link you can refer to for more information.
I created a Deployment Manager Template (python) to create a GKE Zonal cluster (v1beta1 feature). When I run gcloud deployment-manager deployments create <deploymentname> --config <config.yaml>, GKE cluster is created as expected.
I used type:gcp-types/container-v1beta1:projects.zones.clusters in my python template.
However, when I run the delete command on DM i.e. gcloud deployment-manager deployments delete <deploymentname> I get the following error:
Error says that field name could not be found. However, I did specify name in my config.yaml file.
Error in Operation [operation-1536152440470-5751f5c88f9f3-5ca3a167-d12a593d]: errors:
- code: RESOURCE_ERROR
location: /deployments/test-project-gke-xhqgxn6pkd/resources/test-gkecluster-xhqgxn6pkd
message: "{"ResourceType":"gcp-types/container-v1beta1:projects.zones.clusters"
,"ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message"
:"Invalid JSON payload received. Unknown name "name": Cannot bind query
parameter. Field 'name' could not be found in request message.","status"
:"INVALID_ARGUMENT","details":[{"#type":"type.googleapis.com/google.rpc.BadRequest"
,"fieldViolations":[{"description":"Invalid JSON payload received. Unknown
name "name": Cannot bind query parameter. Field 'name' could not be found
in request message."}]}],"statusMessage":"Bad Request","requestPath"
:"https://container.googleapis.com/v1beta1/projects/test-project/zones/us-east1-b/clusters/"
,"httpMethod":"GET"}}"
Here's the sample config.yaml
imports:
- path: templates/gke/gke.py
name: gke.py
resources:
- name: ${CLUSTER_NAME}
type: gke.py
properties:
zone: ${ZONE}
cluster:
name: ${CLUSTER_NAME}
description: test gke cluster
network: ${NETWORK_NAME}
subnetwork: ${SUBNET_NAME}
initialClusterVersion: ${CLUSTER_VERSION}
nodePools:
- name: ${NODEPOOL_NAME}
initialNodeCount: ${NODE_COUNT}
config:
machineType: ${MACHINE_TYPE}
diskSizeGb: 100
imageType: cos
oauthScopes:
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
localSsdCount: ${LOCALSSD_COUNT}
Any ideas what I'm missing here?
I try to deploy simple example. I didn't change anything except warden.yml file. So, I tried to deploy it to AWS and use elastic IP, so I can access the server with specific IP.
When I deploy it I receive:
Director task 67
Deprecation: Ignoring cloud config. Manifest contains 'networks' section.
Started preparing deployment > Preparing deployment. Done (00:00:00)
Started preparing package compilation > Finding packages to compile. Done (00:00:00)
Started creating missing vms > webapp/3a8acd3a-77a8-4bad-8de4-fb544d70f76d (0). Failed: Unknown CPI error 'InvalidCall' with message 'Arguments are not correct, details: 'expected string value for member 1 of key values of member 1 of option filters'' in 'create_vm' CPI method (00:00:05)
Error 100: Unknown CPI error 'InvalidCall' with message 'Arguments are not correct, details: 'expected string value for member 1 of key values of member 1 of option filters'' in 'create_vm' CPI method
What is the reason of this error?
warden.yml
name: webapp-warden
director_uuid: <%= `bosh status --uuid` %>
releases:
- name: simple-bosh-release
version: latest
compilation:
workers: 1
network: webapp-network
reuse_compilation_vms: true
cloud_properties:
instance_type: t2.medium
availability_zone: us-west-2a
update:
canaries: 1
canary_watch_time: 30000-240000
update_watch_time: 30000-600000
max_in_flight: 3
resource_pools:
- name: common-resource-pool
network: webapp-network
size: 1
stemcell:
name: bosh-aws-xen-ubuntu-trusty-go_agent
version: latest
cloud_properties:
instance_type: t2.medium
availability_zone: us-west-2a
networks:
- name: webapp-network
type: vip
cloud_properties:
security_groups:
- default
# cloud_properties:
# subnet: subnet-87d256ce
- name: default
type: dynamic
cloud_properties:
security_groups:
- default
jobs:
- name: webapp
template: webapp
instances: 1
resource_pool: common-resource-pool
networks:
- name: webapp-network
static_ips:
- 52.40.58.163
- name: default
default: [dns, gateway]
properties:
webapp:
admin: foo#bar.com
servername: 52.40.58.163
This error is coming because there is an missing properties in network configuration. Provides the subnet id and try it. it will work.