Knative on GKE is not working with some images, shows RevisionMissing error - google-cloud-platform

I am running Knative on a GKE cluster. The sample images provided on the Knative website work but when I switch to some other images, it stops working. Only 2 containers work out of 3 and route's ready state remains 'unknown' and Reason shows as 'RevisionMissing'.
I tried with multiple images, k8s.gcr.io/hpa-example is one of them.
Edit: The cluster has a two-node of configuration of type n1-standard-4 (4 vCPUs, 15 GB memory). I created this cluster using the GCP console with the latest version of kubernetes, and checking the Enable Istio checkbox. I used following commands to install the Knative:
kubectl apply --selector knative.dev/crd-install=true \
-f https://github.com/knative/serving/releases/download/v0.8.0/serving.yaml \
-f https://github.com/knative/eventing/releases/download/v0.8.0/release.yaml \
-f https://github.com/knative/serving/releases/download/v0.8.0/monitoring.yaml
kubectl apply \
-f https://github.com/knative/serving/releases/download/v0.8.0/serving.yaml \
-f https://github.com/knative/eventing/releases/download/v0.8.0/release.yaml \
-f https://github.com/knative/serving/releases/download/v0.8.0/monitoring.yaml
Thanks

Ok, I found the problem. I tried posting custom images. All worked until I change the port (inside image) to 80. This image not only work as Knative service, but also, It did not work on Cloud run service as well.
Bottom line is, either pull port number from environment variable, or hard code it to any other port than 80.

Thanks for the precisions.
When you installed Knative you should see this kind of errors
# Without CRD
unable to recognize "https://github.com/knative/serving/releases/download/v0.8.0/serving.yaml": no matches for kind "Gateway" in version "networking.istio.io/v1alpha3"
unable to recognize "https://github.com/knative/serving/releases/download/v0.8.0/serving.yaml": no matches for kind "Gateway" in version "networking.istio.io/v1alpha3"
unable to recognize "https://github.com/knative/serving/releases/download/v0.8.0/serving.yaml": no matches for kind "Image" in version "caching.internal.knative.dev/v1alpha1"
unable to recognize "https://github.com/knative/eventing/releases/download/v0.8.0/release.yaml": no matches for kind "ClusterChannelProvisioner" in version "eventing.knative.dev/v1alpha1"
# Without CRD
Error from server (NotFound): error when creating "https://github.com/knative/serving/releases/download/v0.8.0/monitoring.yaml": namespaces "istio-system" not found
Error from server (NotFound): error when creating "https://github.com/knative/serving/releases/download/v0.8.0/monitoring.yaml": namespaces "istio-system" not found
Error from server (NotFound): error when creating "https://github.com/knative/serving/releases/download/v0.8.0/monitoring.yaml": namespaces "istio-system" not found
Error from server (NotFound): error when creating "https://github.com/knative/serving/releases/download/v0.8.0/monitoring.yaml": namespaces "istio-system" not found
You didn't have installed Istio. Do it, relaunch the knative installation (with and without CRD) to solve previous errors and enjoy!!!

Related

My GKE pods stoped with error "no command specified: CreateContainerError"

Everything was Ok and nodes were fine for months, but suddenly some pods stopped with an error
I tried to delete pods and nodes but same issues.
Try below possible solutions to resolve your issue:
Solution 1 :
Check a malformed character in your Dockerfile and cause it to crash.
When you encounter CreateContainerError is to check that you have a valid ENTRYPOINT in the Dockerfile used to build your container image. However, if you don’t have access to the Dockerfile, you can configure your pod object by using a valid command in the command attribute of the object.
So workaround is to not specify any workerConfig explicitly which makes the workers inherit all configs from the master.
Refer to Troubleshooting the container runtime, similar SO1, SO2 & Also check this similar github link for more information.
Solution 2 :
Kubectl describe pod podname command provides detailed information about each of the pods that provide Kubernetes infrastructure. With the help of this you can check for clues, if Insufficient CPU follows the solution below.
The solution is to either:
1)Upgrade the boot disk: If using a pd-standard disk, it's recommended to upgrade to pd-balanced or pd-ssd.
2)Increase the disk size.
3)Use node pool with machine type with more CPU cores.
See Adjust worker, scheduler, triggerer and web server scale and performance parameters for more information.
If you still have the issue, you can then update the GKE version for your cluster Manually upgrading the control planeto one of the fixed versions.
Also check whether you have updated it in the last year to use the new kubectl authentication coming in the GKE v1.26 plugin?
Solution 3 :
If you're having a pipeline on GitLab that deploys an image to a GKE cluster: Check the version of the Gitlab runner that handles the jobs of your pipeline .
Because it turns out that every image built through a Gitlab runner running on an old version causes this issue at the container start. Simply deactivate them and only let Gitlab runners running last version in the pool, replay all pipelines.
Check the gitlab CI script using an old docker image like docker:19.03.5-dind, update to docker:dind helps the kubernetes to start the pod again.

Logstash Google Pubsub Input Plugin fails to load file and pull messages

I'm getting this error when trying to run Logstash pipeline with a configuration that is using google_pubsub on a docker container running in my production env:
2021-09-16 19:13:25 FATAL runner:135 - The given configuration is invalid. Reason: Unable to configure plugins: (PluginLoadingError) Couldn't find any input plugin named 'google_pubsub'. Are you sure this is correct? Trying to load the google_pubsub input plugin resulted in this error: Problems loading the requested plugin named google_pubsub of type input. Error: RuntimeError
you might need to reinstall the gem which depends on the missing jar or in case there is Jars.lock then resolve the jars with `lock_jars` command
no such file to load -- com/google/cloud/google-cloud-pubsub/1.37.1/google-cloud-pubsub-1.37.1 (LoadError)
2021-09-16 19:13:25 ERROR Logstash:96 - java.lang.IllegalStateException: Logstash stopped processing because of an error: (SystemExit) exit
This seems to randomly happen when re-installing the plugin. I thought it's a proxy issue but I have the google domain enabled in the whitelist. Might be the wrong one / missing something. Still, doesn't explain the random failures.
Also, when I run the pipeline in my machine I get GCP events, but when I do it on a VM - no Pubsub messages are being pulled. Could it be a firewall rule blocking them?
The error message suggests there is a problem in loading the ‘google_pubsub’ input plugin. This error generally occurs when the input Pub/Sub plugin is not installed properly. Kindly ensure that you are installing the Logstash Plugin for Pub/Sub correctly.
For example, installing Logstash Plugin for Pub/Sub in a VM :
sudo -u root sudo -u logstash bin/logstash-plugin install logstash-input-google_pubsub
For a detailed demo refer to this community tutorial.

Filebeat and AWS Elasticsearch - Not Working

I have good experience in working with Elasticsearch, I have worked with version 2.4 and now trying to learn new Elasticsearch.
I am trying to implement Filebeat to send my apache and system logs to my Elasticsearch endpoint. To save my time I preferred to launch a t2.medium single node instance over AWS Elasticsearch Service under the public domain and I have attached the access policy to allow everyone to access the cluster.
The AWS Elasticsearch instance is up and running healthy.
I launched a Ubuntu(18.04) server, downloaded the filebeat tar and made the following configuration in filebeat.yml:
#-------------------------- Elasticsearch output ------------------------------
output.elasticsearch:
# Array of hosts to connect to.
hosts: ["https://my-public-test-domain.ap-southeast-1.es.amazonaws.com:443"]
18.04- # Optional protocol and basic auth credentials.
#protocol: "https"
#username: "elastic"
#password: "changeme"
I enabled the required modules :
filebeat modules enable system apache
Then as per the filebeat documentation I changed the ownership of the filebeat file and started the filebeat with the following commands :
sudo chown root filebeat.yml
sudo ./filebeat -e
When I started the filebeat I faced the following permission and ownership issues :
Error loading config from file '/home/ubuntu/beats/filebeat-7.2.0-linux-x86_64/modules.d/system.yml', error invalid config: config file ("/home/ubuntu/beats/filebeat-7.2.0-linux-x86_64/modules.d/system.yml") must be owned by the user identifier (uid=0) or root
To resolve this I changed the ownership for the files which were throwing errors.
When I restarted the filebeat service , I started facing the following issue :
Connection marked as failed because the onConnect callback failed: cannot retrieve the elasticsearch license: unauthorized access, could not connect to the xpack endpoint, verify your credentials
Going through this link , I found that to work with AWS Elasticsearch I will need Beats OSS versions.
So I again downloaded the OSS version for beat from this link and followed the same procedure as above, but still no luck. Now I am facing the following errors :
Error 1:
Attempting to reconnect to backoff(elasticsearch(https://my-public-test-domain.ap-southeast-1.es.amazonaws.com:443)) with 12 reconnect attempt(s)
Error 2:
Failed to connect to backoff(elasticsearch(https://my-public-test-domain.ap-southeast-1.es.amazonaws.com:443)): Connection marked as failed because the onConnect callback failed: 1 error: Error loading pipeline for fileset system/auth: This module requires an Elasticsearch plugin that provides the geoip processor. Please visit the Elasticsearch documentation for instructions on how to install this plugin. Response body: {"error":{"root_cause":[{"type":"parse_exception","reason":"No processor type exists with name [geoip]","header":{"processor_type":"geoip"}}],"type":"parse_exception","reason":"No processor type exists with name [geoip]","header":{"processor_type":"geoip"}},"status":400}
From the second error I can understand that the geoip plugin is not available because of which I facing this error.
What else needs to be done to get this working?
Has anyone been to successfully connect Beats to AWS Elasticsearch?
What other steps I could to take to mitigate the above issue?
Envrionment Details:
AWS Elasticsearch Version : 6.7
File Beat : 7.2.0
First, you need to use OSS version of filebeat with AWS ES https://www.elastic.co/downloads/beats/filebeat-oss
Second, AWS ElasticSearch does not provide GeoIP module, so you will need to edit pipelines for any of the default modules you want to use, and make sure GeoIP is removed/commented out.
For example in /usr/share/filebeat/module/system/auth/ingest/pipeline.json (that's the path when installed from deb package - your path will be different of course) comment out:
{
"geoip": {
"field": "source.ip",
"target_field": "source.geo",
"ignore_failure": true
}
},
Repeat the same for apache module.
I've spent hours trying to make filebeat iis module works with AWS elasticsearch. I kept getting ingest-geoip error, Below fixed the issue.
For windows iis logs, AWS elasticsearch remove geoip from filebeat module configuration:
C:\Program Files (x86)\filebeat\module\iis\access\ingest\default.json
C:\Program Files (x86)\filebeat\module\iis\access\manifest.yml
C:\Program Files (x86)\filebeat\module\iis\error\ingest\default.json
C:\Program Files (x86)\filebeat\module\iis\error\manifest.yml

GCP: kubectl exec/logs fails to container on using UBUNTU as OS

I created a 2 node cluster with OS as UBUNTU.
After deploying a container, trying a kubectl exec or logs fail with following error :-
Error from server: error dialing backend: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user <username>
Please tell how to make it work.
Nodes are part of default pool only.
Steps to reproduce:-
gcloud container clusters create "gke-test-cluster" --image-type=UBUNTU --machine-type=n1-standard-2 --zone us-east1-c --num-nodes 2 --cluster-version=1.8
kubectl create -f https://k8s.io/docs/tasks/debug-application-cluster/shell-demo.yaml
kubectl get pod shell-demo
kubectl exec -it shell-demo -- /bin/bash
Error from server: error dialing backend: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user "gke-0c"?
kubectl logs shell-demo
Error from server: Get https://10.142.0.5:10250/containerLogs/default/shell-demo/nginx: No SSH tunnels currently open. Were the targets able to accept an ssh-key for user "gke-0c"?
I am using my laptop for all CLI commands.
This issue has already been raised at:-
https://issuetracker.google.com/issues/77986235
https://serverfault.com/questions/907468/gcp-kubectl-exec-logs-fails-to-container-on-using-ubuntu-as-os/907882?noredirect=1#comment1177112_907882
I reproduced your issue, with your exact commands and it worked just fine. This has to be an issue due to something else (like the firewall, as in the issue tracker is suggested).
Actually, check to confirm you have these three firewall rules:
gke-gke-test-cluster-07424324-all ...
gke-gke-test-cluster-07424324-ssh ...
gke-gke-test-cluster-07424324-vms ...
About cloud shell and your laptop, there is no much difference, if you are correctly authenticated with Cloud SDK. So to say "This issue is also reproducible from gcp cloud-shell" doesn't really make sense.
If you do have the firewall rules, and don't have much done in the project, I would recommend you to create a new project and start over there.
It was some issue with size of project metadata. We cleaned it up and it worked.

How can I get cadvisor (Docker) working with AWS/Debian?

I have an AWS instance set up (Debian) onto which I've installed Docker.
I can successfully run the hello-world container, as well as running ubuntu as recommended in the Docker install validation.
I want to run cadvisor. So I ran the recommended quick-start script:
sudo docker run \
--volume=/:/rootfs:ro \
--volume=/var/run:/var/run:rw \
--volume=/sys:/sys:ro \
--volume=/var/lib/docker/:/var/lib/docker:ro \
--publish=8080:8080 \
--detach=true \
--name=cadvisor \
google/cadvisor:latest
That gave me no error but when I do a 'sudo docker ps' nothing's there; like it fired up and died or otherwise shut itself down.
I tried adding "--logtostderr" to the end to see what I could see--and saw:
I0108 19:19:55.308016 00001 storagedriver.go:89] Caching 60 recent stats in memory; using "" storage driver
I0108 19:19:55.308353 00001 manager.go:78] cAdvisor running in container: "/docker/e3b5ede6f6def6b36d7682814aefc2b414defaea065ccf977a1a2542a80c310c"
F0108 19:19:55.337891 00001 cadvisor.go:76] Failed to create a Container Manager: failed to get cache information for node 0: open /sys/devices/system/cpu/cpu1/cache: no such file or directory
Do I need to do something different for a Debian system?
If you notice the docker command and the error we are explicitly mounting in the sys directory from the host system. --volume=/sys:/sys:ro and the error is complaining about file in a sub-directory /sys/devices/system/cpu/cpu1/cache. So if that file/folder does not exist in your host vm it won't work inside docker.
I have tested both ubuntu and amazon standard AMI and they seem to have the file mentioned. I don't see debian in the standard AMIs so I have no easy way to test debian but I suspect the image you are using has the required kernel modules or settings missing. Why not use one of the standard Amazon AMIs?
There was a bug we fixed in cAdvisor. The latest version of cAdvisor should work just fine with Debian on AWS or anywhere.