My GKE pods stoped with error "no command specified: CreateContainerError" - google-cloud-platform

Everything was Ok and nodes were fine for months, but suddenly some pods stopped with an error
I tried to delete pods and nodes but same issues.

Try below possible solutions to resolve your issue:
Solution 1 :
Check a malformed character in your Dockerfile and cause it to crash.
When you encounter CreateContainerError is to check that you have a valid ENTRYPOINT in the Dockerfile used to build your container image. However, if you don’t have access to the Dockerfile, you can configure your pod object by using a valid command in the command attribute of the object.
So workaround is to not specify any workerConfig explicitly which makes the workers inherit all configs from the master.
Refer to Troubleshooting the container runtime, similar SO1, SO2 & Also check this similar github link for more information.
Solution 2 :
Kubectl describe pod podname command provides detailed information about each of the pods that provide Kubernetes infrastructure. With the help of this you can check for clues, if Insufficient CPU follows the solution below.
The solution is to either:
1)Upgrade the boot disk: If using a pd-standard disk, it's recommended to upgrade to pd-balanced or pd-ssd.
2)Increase the disk size.
3)Use node pool with machine type with more CPU cores.
See Adjust worker, scheduler, triggerer and web server scale and performance parameters for more information.
If you still have the issue, you can then update the GKE version for your cluster Manually upgrading the control planeto one of the fixed versions.
Also check whether you have updated it in the last year to use the new kubectl authentication coming in the GKE v1.26 plugin?
Solution 3 :
If you're having a pipeline on GitLab that deploys an image to a GKE cluster: Check the version of the Gitlab runner that handles the jobs of your pipeline .
Because it turns out that every image built through a Gitlab runner running on an old version causes this issue at the container start. Simply deactivate them and only let Gitlab runners running last version in the pool, replay all pipelines.
Check the gitlab CI script using an old docker image like docker:19.03.5-dind, update to docker:dind helps the kubernetes to start the pod again.

Related

AWS cloudformation: How to run cfn-nag locally in Windows

I have a cloud formation template where I have all the resources and details for the project.
I have the cfn-lint setup locally and it is running perfectly fine. However when I push the code changes, build fails at deployment stage due to cfn-nag stating some simple changes which could be fixed.
I'm using windows machine and I need a way to run this cfn-nag locally so that I could check this just like cfn-lint and fix them locally instead of waiting 40 minutes for build till it reaches deployment stage.
I referred several posts online, found below two helpful
https://stelligent.com/2018/03/23/validating-aws-cloudformation-templates-with-cfn_nag-and-mu/
https://github.com/stelligent/cfn_nag
What is the difference between cfn-nag and cfn-lint and why lint is not failing on what cfn-nag is complaining about?
The above links have some instructions on Ruby and Brew but I'm using Nodejs, felt lost. Please help.
CFN-Nag looks for patterns in AWS CloudFormation templates that may indicate insecure infrastructure,
Ex:
IAM rules that are too permissive (wildcards),
Security group rules that are too permissive (wildcards),
Access logs that aren’t enabled,
Encryption that isn’t enabled,
CFN-Lint scans the AWS CloudFormation template by processing a collection of Rules, where every rule handles a specific function check or validation of the template. It validates against AWS CloudFormation Resource specification.
This collection of rules can be extended with custom rules using the --append-rules argument.
Ex: Whitespaces, alignment(YAML), type checks, valid values for resource properties, and other best practices.
Those two links you previded above have all the information needed, just not directly for a Nodejs developer using a Windows machine.
Step1: Pull the docket image stelligent/cfn-nag
Step2: Add the script to your package.json for cfn-nag
Ex:
"scripts" : {
"cfn:nag": "cfn-nag"
}
If you're using docker-compose.yml
Add the cfn-nag image details to your docker-compose.yml like below
cfn-nag:
image: "stelligent/cfn-nag"
volumes:
-./path_of_cfn_file_to_copy: /path_to_copy_to
command: ${COMMAND: -/path_to_copy_tp/cfn_file}
Just set the scripts in package.json to run via docker-compose
"cfn:nag": "docker-compose run --rm cfn-nag"

JupyterHub notebook server returning 500 error, pod stuck in "terminating" state

I have an AWS EKS cluster (kubernetes version 1.14) which runs JupyterHub application.
One of the users notebook servers is returning a 500 error
500 : Internal Server Error
Redirect loop detected. Notebook has JupyterHub version unknown (likely < 0.8), but the hub expects 0.9.6. Try installing JupyterHub==0.9.6 in the user environment if you continue to have problems.
You can try restarting your server from the homepage.
Only one user is experiencing this issue, others are not. When I do "kubectl get pod", this users pod shows that it is in state "terminating" (it appears to be stuck in this state).
I was able to fix it, but I can't say this is the right approach. (I would have preferred to diagnose the root cause)
First, I tried deleting the pod kubectl delete pod <pod_name> -- it did not work
Second, I tried force deleting the pod kubectl delete pod <pod_name> --grace-period=0 --force -- it worked, but it turns out this only deletes the handle, the pod resources are then orphaned on the cluster
I checked the node status kubectl get node and noticed one node was stuck in NotReady state. I recycled this node -- still did not work, the user notebook server was still stuck and returning 500 err
Finally, I simply deleted the user notebook server from the jupyter hub admin page. This fixed it....

Elastic Kubernetes Service AWS Deployment process to avoid down time

Its been a month I have started working on EKS AWS and up till now successfully deployed by code.
The steps which I follow for deployment are given below:
Create image from docker terminal.
Tag and push to ECR AWS.
Create the deployment "project.json" and service file "project-svc.json".
Save the above file in "kubectl/bin" path and deploy it with following commands below.
"kubectl apply -f projectname.json" and "kubectl apply -f projectname-svc.json".
So if I want to deployment the same project again with change, I push the new image on ECR and delete the existing deployment by using "kubectl delete -f projectname.json" without deleting the existing service and deploy it again using command "kubectl apply -f projectname.json" again.
Now, I'm in confusing that after I delete the existing deployment there is a downtime until I apply or create the deployment again. So, how to avoid this ? Because I don't want the downtime actually that is the reason why I started to use the EKS.
And one more thing is the process of deployment is a bit long too. I know I'm missing something can anybody guide me properly please?
The project is on .NET Core and if there is any simplified way to do deployment using Visual Studio please guide me for that also.
Thank You in advance!
There is actually no need to delete your deployment. Just need to update the desired state (the deployment configuration) and let K8s do its magic and apply the needed changes, like deploying a new version of your container.
If you have a single instance of your container, you will experience a short down time while changes are applied. If your application supports multiple replicas (HA), you can enjoy the rolling upgrade feature.
Start by reading the official Kubernetes documentation of a Performing a Rolling Update.
You only need to use the delete/apply if you are changing (And if you have) the ConfigMap attached to the Deployment.
Is the only change you do is the "image" of the deployment - you must use the "set-image" command.
Kubectl let you change the actual deployment image and it does the Rolling Updates all by itself and with 3+ pods you have the minimum chance for downtime.
Even more, if you use the --record flag, you can "rollback" to your previous image with no effort because it keep track of the changes.
You also have the possibility to specify the "Context" too, with no need to jump from contexts.
You can go like this:
kubectl set image deployment DEPLOYMENT_NAME DEPLOYMENT_NAME=IMAGE_NAME --record -n NAMESPACE
OR Specifying the Cluster
kubectl set image deployment DEPLOYEMTN_NAME DEPLOYEMTN_NAME=IMAGE_NAME_ECR -n NAMESPACE --cluster EKS_CLUSTER_NPROD --user EKS_CLUSTER --record
As an Eg:
kubectl set image deployment nginx-dep nginx-dep=ecr12345/nginx:latest -n nginx --cluster eu-central-123-prod --user eu-central-123-prod --record
The --record is what let you track all the changes, if you want to rollback just do:
kubectl rollout undo deployment.v1.apps/nginx-dep
More documentations about it here:
Updating a deployment
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#updating-a-deployment
Roll Back Deployment
https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#rolling-back-a-deployment

Cloud Composer GKE Node upgrade results in Airflow task randomly failing

The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!
1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes

Codedeploy-agent keeps more revisions than the number specified in config

On my AWS instance, I have set max_revisions in /etc/codedeploy-agent/conf/codedeployagent.yml to 1 and restarted the service with sudo service codedeploy-agent restart.
However, under /opt/codedeploy-agent/deployment-root/ I still found 4 copies.
Is there any step I missed to restrict the number of copies kept by codedeploy-agent?
When you change the max_revisions in the CodeDeploy agent configuration, there are two steps that need to happen before old copies get cleaned up.
Restart the agent (which you've done)
Deploy a new revision
Since you haven't done a deployment yet, the old revisions are still hanging around. Once you deploy a new revision, you should see the change.