kubeflow deploy gcp endpoints controller fails - google-cloud-platform

I am trying to deploy kubeflow on gcp using official guidelines https://www.kubeflow.org/docs/distributions/gke/deploy/deploy-cli/
I tried three times but it seems there is a problem with endpoints controller. When checked by: kubectl -n kubeflow get all
All pods are running except the
NAME READY STATUS RESTARTS AGE
pod/admission-webhook-deployment-667bd68d94 1/1 Running
pod/cache-deployer-deployment-75ccdc98b4 2/2 Running
pod/cache-server-56f78bf64b 2/2 Running
pod/centraldashboard-5fdbd9b744 1/1 Running
pod/cloud-endpoints-controller-5f7dbc6fc8 0/1 ImagePullBackOff
Pod desciption says that it failed to resolve reference "gcr.io/cloud-solutions-group/cloud-endpoints-controller:0.2.1": unexpected status code [manifests 0.2.1]: 403 Forbidden
I am new to kubeflow but despite retrying this three times it always results in the same issue.

You can clone the repo and build the image yourself and push it to your container registry.
This is one workaround to fix this until the official image is back.
git clone https://github.com/jlewi/cloud-endpoints-controller.git
cd cloud-endpoints-controller
git checkout 0.2.1
docker build . -t <YOUR DOCKER REGISTRY>/cloud-endpoints-controller:0.2.1
docker push <YOUR DOCKER REGISTRY>/cloud-endpoints-controller:0.2.1
And this use the new image in your pod spec.

Urgent release is made: https://github.com/kubeflow/gcp-blueprints/releases/tag/v1.4.1, you can now use v1.4.1 tag for deployment.
---- original -----
Thank you for posting this issue! I have posted a mitigation solution here in https://github.com/kubeflow/gcp-blueprints/issues/343#issuecomment-1028488756. I am planning to fix this issue in the coming release.

Related

My GKE pods stoped with error "no command specified: CreateContainerError"

Everything was Ok and nodes were fine for months, but suddenly some pods stopped with an error
I tried to delete pods and nodes but same issues.
Try below possible solutions to resolve your issue:
Solution 1 :
Check a malformed character in your Dockerfile and cause it to crash.
When you encounter CreateContainerError is to check that you have a valid ENTRYPOINT in the Dockerfile used to build your container image. However, if you don’t have access to the Dockerfile, you can configure your pod object by using a valid command in the command attribute of the object.
So workaround is to not specify any workerConfig explicitly which makes the workers inherit all configs from the master.
Refer to Troubleshooting the container runtime, similar SO1, SO2 & Also check this similar github link for more information.
Solution 2 :
Kubectl describe pod podname command provides detailed information about each of the pods that provide Kubernetes infrastructure. With the help of this you can check for clues, if Insufficient CPU follows the solution below.
The solution is to either:
1)Upgrade the boot disk: If using a pd-standard disk, it's recommended to upgrade to pd-balanced or pd-ssd.
2)Increase the disk size.
3)Use node pool with machine type with more CPU cores.
See Adjust worker, scheduler, triggerer and web server scale and performance parameters for more information.
If you still have the issue, you can then update the GKE version for your cluster Manually upgrading the control planeto one of the fixed versions.
Also check whether you have updated it in the last year to use the new kubectl authentication coming in the GKE v1.26 plugin?
Solution 3 :
If you're having a pipeline on GitLab that deploys an image to a GKE cluster: Check the version of the Gitlab runner that handles the jobs of your pipeline .
Because it turns out that every image built through a Gitlab runner running on an old version causes this issue at the container start. Simply deactivate them and only let Gitlab runners running last version in the pool, replay all pipelines.
Check the gitlab CI script using an old docker image like docker:19.03.5-dind, update to docker:dind helps the kubernetes to start the pod again.

Kubectl against GKE Cluster through Terraform's local-exec?

I am trying to make an automatic migration of workloads between two node pools in a GKE cluster. I am running Terraform in GitLab pipeline. When new node pool is created the local-exec runs and I want to cordon and drain the old node so that the pods are rescheduled on the new one. I am using this registry.gitlab.com/gitlab-org/terraform-images/releases/1.1:v0.43.0 image for my Gitlab jobs. Also, python3 is installed with apk add as well as gcloud cli - downloading the tar and using the gcloud binary executable from google-cloud-sdk/bin directory.
I am able to use commands like ./google-cloud-sdk/bin/gcloud auth activate-service-account --key-file=<key here>.
The problem is that I am not able to use kubectl against my cluster.
Although I have installed the gke-gcloud-auth-plugin with ./google-cloud-sdk/bin/gcloud components install gke-gcloud-auth-plugin --quiet once in the CI job and second time in the local-exec script in HCL code I get the following errors:
module.create_gke_app_cluster.null_resource.node_pool_provisioner (local-exec): E0112 16:52:04.854219 259 memcache.go:238] couldn't get current server API group list: Get "https://<IP>/api?timeout=32s": getting credentials: exec: executable <hidden>/google-cloud-sdk/bin/gke-gcloud-auth-plugin failed with exit code 1
290module.create_gke_app_cluster.null_resource.node_pool_provisioner (local-exec): Unable to connect to the server: getting credentials: exec: executable <hidden>/google-cloud-sdk/bin/gke-gcloud-auth-plugin failed with exit code 1
When I check the version of the gke-gcloud-auth-plugin with gke-gcloud-auth-plugin --version
I am getting the following error:
174/bin/sh: eval: line 253: gke-gcloud-auth-plugin: not found
Which clearly means that the plugin is not installed.
The image that I am using is based on alpine for which there is no way to install the plugin via package manager, unfortunately.
Edit: gcloud components list shows gke-gcloud-auth-plugin as installed too.
The solution was to use google/cloud-sdk image in which I have installed terraform and used this image for the job in question.

Why GCloud Builds submit failing after creating image?

I am learning deploying a pubsub service to run under Cloud Run, by following the guidelines given here
Steps I followed are:
Created a new project folder "myProject" in my local machine
Added below files:
app.jsindex.jsDockerfile
Executed below command to ship the code
gcloud builds submit --tag gcr.io/Project-ID/pubsub
It's mentioned in the tutorial document that
Upon success, you should see a SUCCESS message containing the ID, creation time, and image name. The image is stored in Container Registry and can be re-used if desired.
But in my case it's returning with error: (Ref: screenshot)
I have verified the build logs, "It's success"
So I thought to ignore this error and proceed with the next step to deploy the app by running the command:
gcloud run deploy sks-pubsub-cloudrun --image gcr.io/Project-ID/pubsub --no-allow-unauthenticated
When I run this command it immediately asking to specify the region (26 is my choice) from the list.
Next it fails with error:
Deploying container to Cloud Run service [sks-pubsub-cloudrun] in project [Project-ID] region [us-central1]
Deploying new service... Cloud Run error: The user-provided container failed to start and listen on the port defined provided by the PORT=8080 environment variable.
Logs for this revision might contain more information.
As I am new to this GCP & Dockerizing services, not understanding this issue and unable to fix it. I researched many blogs and articles yet no proper solution for this error.
Any help will be appreciated.
Tried to run the container locally and it's failing with error.
I'm using VS Code IDE, and "Cloud Code: Debug on Cloud Run Emulator" to debug the code.
Starting to debug the app using configuration `Cloud Run: Run/Debug Locally` from .vscode/launch.json
To view more detailed logs, go to Output channel : "Cloud Run: Run/Debug Locally - Detailed"
Dependency check started
Dependency check succeeded
Unpausing minikube
The minikube profile 'cloud-run-dev-internal' has been scheduled to stop automatically after exiting Cloud Code. To disable this on future deployments, set autoStop to false in your launch configuration d:\POC\promo_run_pubsub\.vscode\launch.json
Configuring minikube gcp-auth addon
Using GCP project 'Project-Id' with minikube gcp-auth
Failed to configure minikube gcp-auth addon. Your app might not be able to authenticate Google or GCP APIs it calls. The addon has been disabled. More details can be found in the detailed logs.
Update initiated
Deploy started
Deploy completed
Status check started
Resource pod/promo-run-pubsub-5d4cd64bf9-8pf4q status updated to In Progress
Resource deployment/promo-run-pubsub status updated to In Progress
Resource pod/promo-run-pubsub-5d4cd64bf9-8pf4q status updated to In Progress
Resource deployment/promo-run-pubsub status failed with waiting for rollout to finish: 0 of 1 updated replicas are available...
Status check failed
Update failed with error code STATUSCHECK_CONTAINER_TERMINATED
1/1 deployment(s) failed
Skaffold exited with code 1.
Cleaning up...
Finished clean up.

Cloud Run /Cloud Code deployment error in intellij

trying to follow the Getting Started instructions for Deploying a Cloud Run service with Cloud Code in Intellij (deploying HelloWorld Flask app container with Cloud Run: Deploy) but getting the following error, any idea why this might be happening
it worked initially i.e. deployed the app on Cloud Run service using the same steps, and then started throwing this error after a week or so when trying to redeploy, there was no change in project settings.
intellij and docker versions are the latest.
authenticated to google cloud project with gcloud auth login --update-adc
The local run works fine (Cloud Run: Run Locally),
but running the Cloud Run: Deploy throws this "code 89" error
Preparing Google Cloud SDK (this may take several minutes for first time setup)...
Creating skaffold file: /var/.../skaffold8013155926954225609.tmp
Configuring image push settings in /var/.../skaffold8013155926954225609.tmp
../Library/Application Support/cloud-code/bin/versions/../
skaffold build --filename /var/.../skaffold8013155926954225609.tmp --tag latest --skip-tests=true
invalid skaffold config: getting minikube env:
running [/Users/USER/Library/Application Support/google-cloud-tools-java/managed-cloud-sdk/LATEST/google-cloud-sdk/bin/
minikube docker-env --shell none -p minikube --user=skaffold]
- stdout: "false exit code 89"
- stderr: ""
- cause: exit status 89
Failed to build and push Cloud Run container image.
Please ensure your builder settings are correct, network is available, you are logged in to a valid GCP project, and try again.
Edit: I see minikube error code 89: ExGuestUnavailable and it's an error code specific to the guest host, still unclear what might be causing this
Looks like an issue with skaffold attempting to communicate with minikube (which could be used for building images as well). Please try cleaning minikube
minikube stop
minikube delete --all --purge
and try again.
ok, i still don't know why it fails to deploy to cloud run from intellij but i got it to deploy from command line
cd my-flask-app
#step 1: build container image from Dockerfile and submit to container registry
gcloud builds submit --tag gcr.io/GCP_PROJECT_ID/my-flask-app
#step 2: deploy the image on cloud run (reference)
gcloud run deploy --image gcr.io/GCP_PROJECT_ID/my-flask-app
references:
https://cloud.google.com/build/docs/building/build-containers
https://cloud.google.com/container-registry/docs/quickstart
Edit: the answer above did the trick : minikube delete --all --purge

Cloud Composer GKE Node upgrade results in Airflow task randomly failing

The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!
1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes