NLP Flask app startup nodes timing out on Google Kubernetes GKE

NLP Flask app startup nodes timing out on Google Kubernetes GKE - flask

I have a flask app that includes some NLP packages and takes a while to initially build some vectors before it starts the server. I've noticed this in the past with Google App Engine and I was able to set a max timeout in the app.yaml file to fix this.
The problem is that when I start my cluster on Kubernetes with this app, I notice that the workers keep timing out in the logs. Which makes sense because I'm sure the default amount of time is not enough. However, I can't figure out how to configure GKE to allow the workers enough time to do everything it needs to do before it starts serving.
How do I increase the time the workers can take before they timeout?
I deleted the old instances so I can't get the logs right now, but I can start it up if someone wants to see the logs.
It's something like this:
I 2020-06-26T01:16:04.603060653Z Computing vectors for all products
E 2020-06-26T01:16:05.660331982Z
95it [00:05, 17.84it/s][2020-06-26 01:16:05 +0000] [220] [INFO] Booting worker with pid: 220
E 2020-06-26T01:16:31.198002748Z [nltk_data] Downloading package stopwords to /root/nltk_data...
E 2020-06-26T01:16:31.198056691Z [nltk_data] Package stopwords is already up-to-date!
100it 2020-06-26T01:16:35.696015992Z [CRITICAL] WORKER TIMEOUT (pid:220)
E 2020-06-26T01:16:35.696015992Z [2020-06-26 01:16:35 +0000] [220] [INFO] Worker exiting (pid: 220)
I also see this:
The node was low on resource: memory. Container thoughtful-sha256-1 was using 1035416Ki, which exceeds its request of 0.
Obviously I don't exactly know what I'm doing. Why does it say I'm requesting 0 memory and can I set a timeout amount for the Kubernetes nodes?
Thanks for the help!

One thing you can do is add some sort of delay in a startup script for your GCP instances. You could try a simple:
#!/bin/bash
sleep <time-in-seconds>
Another thing you can try is adding some sort of delay to when your containers start in your Kubernetes nodes. For example, a delay in an initContainer
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
labels:
app: myapp
spec:
containers:
- name: myapp-container
image: myapa:latest
initContainers:
- name: init-myservice
image: busybox:1.28
command: ['sh', '-c', "echo Waiting a bit && sleep 3600"]
Furthermore, you can try a StartupProbe combined with the Probe parameter initialDelaySeconds on your actual application container that way it actually waits for some time before saying: I'm going to see if the application has started.:
startupProbe:
exec:
command:
- touch
- /tmp/started
initialDelaySeconds: 3600

Related

Django + Celery + Redis on a K8s cluster

I have a Django application deployed on a K8s cluster. I need to send some emails (some are scheduled, others should be sent asynchronously), and the idea was to delegate those emails to Celery.
So I set up a Redis server (with Sentinel) on the cluster, and deployed an instance for a Celery worker and another for Celery beat.
The k8s object used to deploy the Celery worker is pretty similar to the one used for the Django application. The main difference is the command introduced on the celery worker: ['celery', '-A', 'saleor', 'worker', '-l', 'INFO']
Scheduled emails are sent with no problem (celery worker and celery beat don't have any problems connecting to the Redis server). However, the asynchronous emails - "delegated" by the Django application - are not sent because it is not possible to connect to the Redis server (ERROR celery.backends.redis Connection to Redis lost: Retry (1/20) in 1.00 second. [PID:7:uWSGIWorker1Core0])
Error 1:
socket.gaierror: [Errno -5] No address associated with hostname
Error 2:
redis.exceptions.ConnectionError: Error -5 connecting to redis:6379. No address associated with hostname.
The Redis server, Celery worker, and Celery beat are in a "redis" namespace, while the other things, including the Django app, are in the "development" namespace.
Here are the variables that I define:
- name: CELERY_PASSWORD
valueFrom:
secretKeyRef:
name: redis-password
key: redis_password
- name: CELERY_BROKER_URL
value: redis://:$(CELERY_PASSWORD)#redis:6379/1
- name: CELERY_RESULT_BACKEND
value: redis://:$(CELERY_PASSWORD)#redis:6379/1
I also tried to define CELERY_BACKEND_URL (with the same value as CELERY_RESULT_BACKEND), but it made no difference.
What could be the cause for not connecting to the Redis server? Am I missing any variables? Could it be because pods are in a different namespace?
Thanks!

Solution from #sofia that helped to fix this issue:
You need to use the same namespace for the Redis server and for the Django application. In this particular case, change the namespace "redis" to "development" where the application is deployed.

Wsgi number of process and threads setting in AWS Beanstalk

I have an AWS beanstalk env and have old setting of wsgi (given below), I do not have idea how does this work internally, can anybody guide me?
NumProcesses:7 -- number of process
NumThreads:5 -- number of thread in each process
How memory and cpu are being used with this configuration because there is no memory and cpu settings in AWS beanstalk level.

These parameters are part of configuration option for Python environment:
aws:elasticbeanstalk:application:environment.
They mean (from docs):
NumProcesses: The number of daemon processes that should be started for the process group when running WSGI applications (default value 1).
NumThreads: The number of threads to be created to handle requests in each daemon process within the process group when running WSGI applications (default value 15).
Internally, these values map to uwsgi or gunicorn configuration options in your EB environment. For example:
uwsgi --http :8000 --wsgi-file application.py --master --processes 4 --threads 2
Their impact on memory and cpu usage of your instance(s) is based on your application and how resource intensive it is. If you are not sure how to set them up, maybe keeping them at default values would be a good start.
The settings are also available in the EB console, under Software category:

To add on to #Marcin
Amazon linux 2 uses gunicorn
workers are processes in gunicorn
Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally, we (gunicorn creators) recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
To see how the settings in the option settings map to gunicorn you can ssh into your eb instance, go
$ eb ssh
$ cd cd /var/app/current/
$ cat Procfile
web: gunicorn --bind 127.0.0.1:8000 --workers=3 --threads=20 api.wsgi:application
--threads
A positive integer generally in the 2-4 x $(NUM_CORES) range. You’ll want to vary this a bit to find the best for your particular application’s work load.
The threads option only applies to gthread worker type. gunicons default worker class is sync, If you try to use the sync worker type and set the threads setting to more than 1, the gthread worker type will be used instead automatically
based on all the above I would personally choose
workers = (2 x $NUM_CORES ) + 1
threads = 4 x $NUM_CORES
for a t3.medum instance that has 2 cores that translates to
workers = 5
threads = 8
obviously, you need to tweak this for your use case, and treat these as defaults that could very well not be right for your particular application use case, read the refs below to see how to choose the right setup for you use case
References:
REF: Gunicorn Workers and Threads
REF: https://medium.com/building-the-system/gunicorn-3-means-of-concurrency-efbb547674b7
REF: https://docs.gunicorn.org/en/stable/settings.html#worker-class

Deploying Django channels app on google flex engine

I am working on django channels and getting problem while deploying them on google flex engine,first I was getting error of 'deployment has failed to become healthy in the allotted time' and resolved it by adding readiness_check in app.yaml,now I am getting below error:
(gcloud.app.deploy) Operation [apps/socketapp-263709/operations/65c25731-1e5a-4aa1-83e1-34955ec48c98] timed out. This operation may still be underway.
App.yaml
runtime: python
env: flex
runtime_config:
python_version: 3
instance_class: F4_HIGHMEM
handlers:
# This configures Google App Engine to serve the files in the app's
# static directory.
- url: /static
static_dir: static/
- url: /.*
script: auto
# [END django_app]
readiness_check:
check_interval_sec: 120
timeout_sec: 40
failure_threshold: 5
success_threshold: 5
app_start_timeout_sec: 1500
How can I fix this issue,any suggestions?

The following error is due to several issues:
1) You aren't configuring correctly your app.yaml file. The resource request in App Engine Flexible is not through instance_class option, to request resources, you've to use resources option like following:
resources:
cpu: 2
memory_gb: 2.3
disk_size_gb: 10
volumes:
- name: ramdisk1
volume_type: tmpfs
size_gb: 0.5
2) You're missing an entrypoint for your app. To deploy Django channels, they suggest to have an entrypoint for Daphne server. Add in your app.yaml the following code:
entrypoint: daphne -b 0.0.0.0 -p 8080 mysite.asgi:application
3) After doing the previous, if you still get the same error, it's possible that your In-use IP addresses quota in the region of your App Engine Flexible application has reached its limit.
To check this issue, you can go to "Activity" tab of your project home page. Their, you can see warnings of quota limits and VM's failing to be created.
App Engine by default leaves the previous versions of your App, and running that may take IP addresses.You can delete the previous versions and/or request an increase of your IP address quota limit.
Also update gcloud tools and SDK which may resolve the issue.
To check your in-use addresses click here and you will be able to increase your quota by clicking the 'Edit Quotas' button in the Cloud Console.

502 Bad Gateway: nginx - Log: upstream prematurely closed connection while reading... [Django 2.1 | gunicorn 19.7.1 | Google Cloud Platform]

I'm new in a company that have this project on Google Cloud PLatform, that I've never used. In my first day I've got this 502 Bad Gateway error. Lookin the log on the Google Cloud Platform I've got the following:
[error] 33#33: *285 upstream prematurely closed connection while reading response header from upstream, client: 172.217.172.212, server: , request: "POST /product/fast_appraisal/result/ HTTP/1.1", upstream: "http://172.17.0.1:8080/product/fast_appraisal/result/", host: "avalieidjango.appspot.com", referrer: "https://avalieidjango.appspot.com/product/fast_appraisal/search/"
I've tried to edit the app.yaml adding --timeout and --graceful-timeout parameters to it like the following:
# [START runtime]
runtime: python
env: flex
entrypoint: gunicorn -b :$PORT mysite.wsgi --timeout=90 --graceful-timeout=10
beta_settings:
cloud_sql_instances: avalieidjango:southamerica-east1:avaliei
runtime_config:
python_version: 3
handlers:
- url: /.*
script: manage.py
secure: always
redirect_http_response_code: 301
# [END runtime]
In the settings.py file DEBUG variable is setted to False
Looking for answers on the internet I found a few cases, but no one looks like mine exactly.
Locally I'm running the project on Windows 7, so the error only occurs when I deploy it to GCP. I'm new on GCP and gunicorn.
Edit
After these days I've passed through a lot of forums, and added a few new configurations to my app.yaml trying to work with threads and workers to solve the question.
The entrypoint line looks like this:
entrypoint: gunicorn -b :$PORT --worker-class=gevent --worker-connections=1000 --workers=3 mysite.wsgi --timeout 90
This project consists in search a Postgres database on GCP gathering information about properties and running an AI to show some predictions about its values.
I've tried threads and process, but even with just my requests the application is too slow, even a simple page take some time to render.
Local tests run better, but in production it isn't working at all.
The AI wasn't developed for me and it uses a large joblib file.
The project doesn't use containers like Docker. Maybe it could help in some way if I "dockerize" the project?

I've stoped seeing this error changing CONN_MAX_AGE value to None, which keeps the database connetion time undefined. However this may cause some security issues that must be evaluated before deploy your application. If you change it, stay tuned on Google Cloud Logs looking for strange connection attempts.

Apache Graceful restart with Ansible

What is ideal ansible way to do a apache graceful restart?
- name: Restart Apache gracefully
command: apachectl -k graceful
Ansible systemd module does the same? If not, what is the difference? Thanks !
- name: Restart apache service.
systemd:
name: apache2
daemon_reload: yes
state: restarted

What you can do with Ansible is to ensure that all established connections to Apache are closed (drained in Ansible lingo).
Use the wait_for module with the condition to wait for drained connections on the particular host and port, with the state set to drained. See below:
- name: wait until apache2 connections are drained.
wait_for:
host: 0.0.0.0
port: 80
state: drained
Note: You can use this for all your Linux network services, which becomes very handy if you want to shutdown services in a particular order in your Ansible playbook.
The wait_for directive is useful to ensuring that Ansible does not run your playbook until specific steps are completed.

There is not support of graceful state at this moment in service or systemd modules because this is quite specific to certain services, status is limited to started, stopped, restarted reloaded and running.
So now you need to use a command module as you wrote in the question to perform a graceful restart, this is the only proper solution.
However there is an issue to support custom status, perhaps someone will implement that soon.

The documentation for the Ansible service module is not clearly stating what "reloaded" state does but, I found that for standard Red Hat 7 install using service module "reloaded" state results in a graceful restart.
I was led to this solution by this Server Fault QA
You can verify by getting process list of the httpd processes prior running your playbook which triggers your handler.
ps -ef | grep httpd | grep -v grep
After your playbook runs and handler reloaded state for httpd service shows "changed", re-examine the process list again.
You should see the start times for all the child httpd (non-root) processes have updated while the root owned parent process's start time has stayed the same.
If you also look in the error log you should see an entry containing:
"... configured -- resuming normal operations ... "
And, finally, you can see this by examining the output of systemctl status for the httpd.service and see the apachectl graceful option was called:
sudo systemctl status httpd.service
My handler now looks like:
- name: "{{ service_name }} restart handler"
become: yes
ansible.builtin.service:
service: "{{ service_name }}"
# state: restarted
state: reloaded

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js