GKE cluster fails to create instances - google-cloud-platform

Creating a gcloud kubernetes cluster from a script and the cluster is takes forever to create, just ends up failing after 35 minutes.
Command:
gcloud container clusters create 148374ed-92b0-4088-9623-c22c5aee3 \
--num-nodes 3 \
--enable-autorepair \
--cluster-version 1.11.2-gke.9 \
--scopes storage-ro \
--zone us-central1-a
The error are not clear, looks like some kind of buffer overflow internal to gcloud.
Deploy error: Not all instances running in IGM after 35m6.391174155s. Expect 3.
Current errors: [INTERNAL_ERROR]: Instance 'gke-148374ed-92b0-default-pool-66d3729f-6mw3' creation failed: Code: '-2097338327842179396' - ; Instance 'gke-148374ed-92b0-default-pool-66d3729f-qwpd' creation failed: Code: '-2097338327842179396' - ; .
Any ideas for debugging this?

I've been facing similar issue while creating a cluster for the past 3 hours. A ticket has already been raised and GCP engineering team is working on the fix.
For status updates on the ticket, visit https://status.cloud.google.com/incident/compute/18012

Related

Issues creating AlloyDB

Creating new AlloyDB instances has been failing for the past 24 hours. It was working fine a few days ago
# creating the cluster works
gcloud beta alloydb clusters create dev-cluster \
--password=$PG_RAND_PW \
--network=$PRIVATE_NETWORK_NAME \
--region=us-east4 \
--project=${PROJECT_ID}
# creating primary instance fails
gcloud beta alloydb instances create devdb \
--instance-type=PRIMARY \
--cpu-count=2 \
--region=us-east4 \
--cluster=dev-cluster \
--project=${PROJECT_ID}
Error message is
Operation ID: operation-1660168834702-5e5ea2da8dcd1-d96bdabb-4c686076
Creating instance...failed.
ERROR: (gcloud.beta.alloydb.instances.create) an internal error has occurred
Creating from the console fails also
I have tried from a complete new project also and it still fails.
Any suggestions?
I've managed to replicate your issue and it seems that this is due to AlloyDB for PostgreSQL is still in preview and we may encounter some bugs and errors according to this documentation:
This product is covered by the Pre-GA Offerings Terms of the Google Cloud Terms of Service. Pre-GA products might have limited support, and changes to pre-GA products might not be compatible with other pre-GA versions. For more information, see the launch stage descriptions.
What worked on my end is following this documentation on creating a cluster and its primary instance using the console. This step will create both the cluster and its primary instance at the same time. Please see my screenshot below for your reference:
As you can see the instance under the cluster my-cluster has an error and was not created however the instance devdb was created following the link that I provided above.
It would also be best to raise this as an issue as per #DazWilkin's comment if this issue would still persist in the future.

How Do I Grab the Number of Virtual Machines Running in a Node Pool using the CLI

Is it possible to find the number of running Compute Engine Virtual Machines (VM) that belong to a single Google Kubernetes Engine (GKE) node pool using the Google Cloud Platform (GCP) SDK (gcloud) instead of the console?
It is possible to find all Google Cloud Platform (GCP) Compute virtual machines (VM or nodes) that belong to a node pool with the GCP SDK:
gcloud compute instances list \
--filter="metadata.items[].filter(key:kube-labels).firstof(value):(cloud.google.com/gke-nodepool=${GCP_NODE_POOL_NAME})"
#=>
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
gke-. . .-wwww us-central1-c g1-small 10.128.0.01 123.456.789.0 RUNNING
gke-. . .-xxxx us-central1-c g1-small 10.128.0.02 123.456.789.1 RUNNING
gke-. . .-yyyy us-central1-c g1-small 10.128.0.03 123.456.789.2 RUNNING
gke-. . .-zzzz us-central1-c g1-small 10.128.0.04 123.456.789.3 RUNNING
The --format flag might make your node pool count easier:
gcloud compute instances list \
--filter="metadata.items[].filter(key:kube-labels).firstof(value):(cloud.google.com/gke-nodepool=${GCP_NODE_POOL_NAME})" \
--format="value(status)"
#=>
RUNNING
RUNNING
RUNNING
RUNNING
Unfortunately, there currently does not seem to be a way to get a count of the current running nodes via the node-pool gcloud group:
gcloud container node-pools describe \
$GCP_NODE_POOL_NAME \
--cluster=$GKE_CLUSTER_NAME \
--format=json \
--zone=$GCP_NODE_POOL_ZONE
#=>
{
"autoscaling": {
. . .
"maxNodeCount": 5,
"minNodeCount": 2
},
. . .
"initialNodeCount": 3,
. . .
}
There is currently an open feature request to add a currentNodeCount or a runningNodeCount to the above response here.
Assuming that you want a gcloud command to view no.of nodes that are running in nodepool of gke cluster.
The command for viewing no.of nodepools in gke cluster is
gcloud container node-pools list --cluster CLUSTER_NAME
The command for viewing no.of nodes in gke cluster u need to go for kubectl only.
kubectl get nodes
there is no command to view nodes in cluster by using gcloud .
After give some time I found out that there is no command to view no of nodes in node pools of GKE cluster through gcloud. But can be view using below command
kubectl get nodes -l cloud.google.com/gke-nodepool=$POOL_NAME -o=name --kubeconfig="$(pwd)/cluster" | wc -l

aws cli rds restore-db-cluster-from-snapshot giving internal error

Trying to restore a cluster snapshot as part of a code build pipeline to copy a prod db into staging on a regular basis. Run into an issue with the aws cli command to restore the cluster throwing a strange error below.
Here's the command i'm trying to run below. It's taken from this https://aws.amazon.com/blogs/devops/enhancing-automated-database-continuous-integration-with-aws-codebuild-and-amazon-rds-database-snapshot/
aws rds restore-db-cluster-from-snapshot \
--snapshot-identifier arn:aws:rds:region-ID:account-ID:cluster-snapshot:db-snapshot-identifier \
--db-cluster-identifier myidentifiernameforrestore \
--engine aurora
I get the following error when executing the command:
An error occurred (InternalFailure) when calling the RestoreDBClusterFromSnapshot operation (reached max retries: 4): An internal error has occurred. Please try your query again at a later time.
Any ideas?
The issue is with the last part of the command
--engine aurora
Change this to --engine aurora-mysql and works and deploys the cluster.
Thanks for the tip #Marcin

Can't add CloudRun as add-on when creating a GKE cluster

I'm trying out Cloud Run with GKE, was wondering about this error:
ERROR: (gcloud.beta.container.clusters.create) argument --addons:
CloudRun must be one of [HttpLoadBalancing, HorizontalPodAutoscaling,
KubernetesDashboard, Istio, NetworkPolicy]
It seems it won't let me use CloudRun as an addon, I'm just creating a cluster.
Whole command is:
gcloud beta container clusters create testcloudrun \
--addons=HorizontalPodAutoscaling,HttpLoadBalancing,Istio,CloudRun \
--machine-type=n1-standard-4 \
--cluster-version=1.12.6-gke.16 --zone=us-central1-a \
--enable-stackdriver-kubernetes --enable-ip-alias \
--scopes cloud-platform
I'm just following the quick-start from the docs:
https://cloud.google.com/run/docs/quickstarts/prebuilt-deploy-gke
~~
Update:
I've tried creating a cluster via Cloud Console, and I'm getting an error:
Horizontal pod autoscaling must be enabled in order to enable the Cloud Run addon.
Which is a known issue as well:
https://cloud.google.com/run/docs/issues

Creating Bigtable replica cluster gives metric error

Playing with the newest Bigtable feature: cross-region replication.
I've created an instance and a replica cluster in a different region with this snippet:
gcloud bigtable instances create ${instance_id} \
--cluster=${cluster_id} \
--cluster-zone=${ZONE} \
--display-name=${cluster_id} \
--cluster-num-nodes=${BT_CLUSTER_NODES} \
--cluster-storage-type=${BT_CLUSTER_STORAGE} \
--instance-type=${BT_TYPE} \
--project=${PROJECT_ID}
gcloud beta bigtable clusters create ${cluster_id} \
--instance=${instance_id} \
--zone=${ZONE} \
--num-nodes=${BT_CLUSTER_NODES} \
--project=${PROJECT_ID}
The instance created successfully, but creating the replica cluster gave me an error: ERROR: (gcloud.beta.bigtable.clusters.create) Metric 'bigtable.googleapis.com/ReplicationFromEUToNA' not defined in the service configuration.
However the cluster created and replication worked.
I know this is currently beta, but do I need to change my setup script, or this is something on GCP side?
I can confirm that this is an issue on the GCP side. As you noted this is happening after replication is set up, so there should be no practical impact to you.
We have a ticket open to fix the underlying issue, which is purely around reporting the successful copy to our own internal monitoring. Thanks for the report!