Kubernetes liveness probes stop responding - amazon-web-services

I'm using kube-aws to set up a production Kubernetes cluster on AWS. Now I'm running into an issue I am unable to recreate in my dev environment.
When a pod is running heavy computation, which in my case happens in bursts, the liveness and readiness probes stop responding. In my local environment, where I run docker compose, they work as expected.
The probes use simple HTTP 204 No Content output using native Go functionality.
Has anyone seen this before?
EDIT:
I'd be happy to provide more information, but I am uncertain what I should provide as there is a tremendous amount of configuration and code involved. Primarily, I'm looking for help to troubleshoot and where to look to try to locate the actual issue.

Related

Open Telemetry Context Tracing for Apache and/or Nginx

We are implementing distributed tracing in our environment, starting with simple auto-instrumentation, using Open Telemetry. Our environment is primarily cloud based, running on AWS.
We have had success auto-instrumenting most of our cloud services (ECS, EKS, Lambda, etc.), and are seeing context tracing being passed from one service to the next. We are also auto-instrumenting Apache and Nginx servers running on EC2, using the Otel standard, and are successfully seeing trace information being collected, but calls from Apache to another front-end or back-end service are not being tied together by trace context. Apache produces it's own trace id and the system it calls is producing it's own as well, and the linkage is lost.
Has anybody been able to get this to work and are there samples you can share?
Thanks so much!
We have tried using the Otel libraries, as well as the AWS distributed tracing libraries, and have played around with different exporters and collectors. The tracing capabilities work individually, but when it comes time to pass context from Apache and/or Nginx to some other service, the trace link is broken

Handling long file uploads with Django on GKE

I need to be able to upload large files on a Django app hosted on GKE which can take a long time if the connection is slow. I'm relying on gunicorn to serve the application. The problem I have is: once I start the file upload, the gunicorn worker is stuck processing the upload request and cannot process anything else. As a result, kubernetes cannot access the liveness probe and restart the container, which kills the upload. It also means my site is very unresponsive.
I could add more gunicorn worker but it's not a proper solution since users can upload many files in parallel. It would just increase the threshold of what I can manage.
Scaling up would solve the responsiveness but not the fact the user needs time to upload the file.
Since this issue is usually (by usually I mean when not running on kubernetes) solved by putting a nginx sever in front of gunicorn, I was wandering how I could to this with kubernetes. I think I have two solutions:
Add an nginx sidecar in the pod. This looks simple enough and would avoids installing something cluster wide. I can also let the GKE ingress handle HTTPS traffic.
Switch ingress to use a nginx based one instead of the default GKE one. From what I saw here, this is possible. This could solve the issue for all services and would avoid wasting resources on each pod.
Do you have any feedback on this (or other solutions)? I'm still new to kubernetes and don't really understand all the implications of doing this.

504 timeout on AWS with nginx and gunicorn

I am running a python Django app on an AWS EC2 instance. It uses gunicorn and nginx to serve the app, the EC2 is behind an application load balancer. Occasionally I get 504 error where the entire EC2 instance becomes unreachable for everyone (including via SSH which I use all the time otherwise). I then need to restart everything which takes time.
I can replicate the error by overloading the app (e.g. uploading and processing a very large image), in that case, gunicorn worker times out (I see the timeout message in logs), 504 error appears and the instance becomes unreachable. I set my gunicorn to time out in 5 minutes (300 seconds) but it falls down quicker than that. There is nothing really useful in CloudWatch logs.
I am looking for ways to resolve this for all current and future cases. I.e., I want to have the situation where, if the site gets overloaded, it returns an error message as opposed to becoming completely unreachable for everyone. Is there a way to do that?
There are many things to consider and test here in order to get what is a reason for this, but I think it is OOM(out of memory) mainly because you have to restart even to login in SSH.
Nginx uses "event‑driven" approach to handle requests so a single worker of nginx can handle 1000s of req simultaneously. But Gunicorn on the other hand mostly(by default) uses sync worker which means a request will remain with a worker till it is processed.
When you put a large request your machine tries to process that request until an overflow occurs, mostly it will not get detected by any service running inside a machine. Just try to monitor memory by any monitoring tool in AWS or just SSH and use htop before calling the API.
In most cases with Django/gunicorn the culprit is oom.
Edit:
AFAIK You cannot capture(cache) an oom, the only thing you can do is aftermath i.e after system restart sees/var/logs/syslogs ... As I said monitor in AWS memory monitor(I don't have much experience with AWS).
And regarding the solution,
you first increase the memory of your EC2 until you don't get an error to see how big the problem is.
Then you optimise your application by profiling which part is actually taking this much memory. I haven't used any memory profiling so maybe you can tell me after which is better.
The only thing you can do is optimise your application see common gotchas, best practices, Query optimisations etc.
https://haydenjames.io/how-to-diagnose-oom-errors-on-linux-systems/
https://www.pluralsight.com/blog/tutorials/how-to-profile-memory-usage-in-python

How can I scale CloudFoundry applications "down" without the risk of restarting all of them?

This is a question regarding the Swisscom Application Cloud.
I have implemented a strategy to restart already deployed CloudFoundry applications without using cf restart APP_NAME. I wanted to be able to:
restart running applications without needing access the app manifest and
avoid them suffering any down-time.
The general concept looks like this:
cf scale APP_NAME -I 2
increasing the instance count of the app from 1 to 2
wait for all app instances to be running
cf restart-app-instance APP_NAME 0
restart the "old" app instance
wait for all app instances to be running again
cf scale easyasset-repower-staging -I 1
decrease the instance count of the app back from 2 to 1
This generally works and usually does what I expect it to do. The problem I am having occurs at Stage (3), where sometimes instead of just scaling the instance count back, CloudFoundry will also restart all (remaining) instances.
I do not understand:
Why does this happen only sometimes (all apps restart when scaling down)?
Shouldn't CloudFoundry keep the the remaining instances up and running?
If cf scale is not able to keep perfectly fine running app instances alive - when is it useful?
Please Note:
I am well aware of the Bluegreen / Autopilot plugins for zero-down-time deployment of applications in CloudFoundry and I am actually using them for our deployments from our build server, but they require me to provide a manifest (and additional credentials), which in this case I don't have access to (unless I can somehow extract it from a running app via cf create-app-manifest?).
Update 1:
Looking at the plugins again I found bg-restage, which apparently does approximately what I want, but I have no idea how reliable that one is.
Update 2:
I have concluded that it's probably an obscure issue (or bug) in CloudFoundry and that there are no guarantees given by cf scale that existing instances are to remain running. As pointed out above, I have since realised that it is indeed very much possible to generate the app manifest on the fly (cf create-app-manifest) and even though I couldn't use the bg-restage plugin without errors, I reverted back to the blue-green-deploy plugin, which I can now hand a freshly generated manifest to avoid this whole cf scale exercise.
Comment Questions:
Why do you have the need to restart instances of your application?
We are caching some values from persistent storage on start-up. This restart is happening when changes to that data was detected.
information about the health-check
We are using all types of health checks, depending on which app is to be re-started (http, process and port). I have observed this issue only for apps with health checkhttp. I also have ahttp-endpoint` defined for the health check.
Are you trying to alter the memory with cf scale as well?
No, I am trying to keep all app configuration the same during this process.
When you have two running instances, the command
cf scale <APP> -i 1
will kill instance #1 and instance #0 will not be affected.

Can't rerun meteor leaderboard on AWS EC2 micro T1 instance after failing keepalive

I'm unable to run a Meteor leaderboard demo after a failed keepalive error on an AWS EC2 micro.T1 instance. If I start from a freshly booted Amazon Machine Instance (AMI) I'm able to run the leaderboard demo at localhost:3000 from Firefox when I'm connected with a VNC client (TightNVC Viewer). It runs very, very slowly, but it runs.
If I fail to interact with it soon enough however I get these messages
I2051-00:03:03.173(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
From that point forward everything on that instance runs at a glacial pace. Switching back to the Firefox window takes 3 minutes. when I try to connect to //localhost:3000 Firefox I usually get a message about a script no longer running and eventually the terminal window adds this to what I wrote above:
I2051-00:06:02.443(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
I2051-00:08:17.227(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Your application is crashing. Waiting for file change.
Can anyone translate for me what is happening?
I'm wondering whether the t1.micro instance I'm running is just too under-powered or because it's not shutting down meteor properly thereby leaving an instance of MongoDB running and trying to launch another.
I'm using Amazon Machine Image ubuntu-precise-12.04-amd64-server-20130411.1 (ami-70f96e40) which says this about it's configuration:
Size: t1.micro
ECUs: up to 2
vCPUs: 1
Memory (GiB): 0.613
Instance Storage (GiB): EBS only
EBS-Optimized Available: -
Netw. Performance: -Very Low
Micro instances
Micro instances are a low-cost instance option, providing a small amount of CPU resources. They are suited for lower throughput applications, and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance. Popular uses for micro instances include low traffic websites or blogs, small administrative applications, bastion hosts, and free trials to explore EC2 functionality.
If my guess is right, can anyone suggest an AMI suitable for Meteor development?
Thanks
check this answer
Try to remove meteor remove autopublish
How are you running the app on ec2? I have been able to run apps on a micro instance so I don't see why this should be an issue.
If you are running it by using 'meteor' as you would locally that's probably the issue. You get way better performance when running it as a node app, this typically isn't an issue when developing locally but may be too much for a ec2 micro.
What you want to do is 'meteor bundle example.tgz', upload that to the server and run it as a node app.
Here is a guide that I remember using a while ago to get it done on ec2:
http://julien-c.fr/2012/10/meteor-amazon-ec2/
You shouldn't need to use VNC either, you can access it from your own computer in a browser using the public address your instance gets assigned.
If you get a node fibers error message which is pretty common then cd into bundle/program/server do 'npm uninstall fibers' and then 'npm install fibers'