Performance issues AWS - Docker - Django - django

I have an application deployed with docker on an EC2 instance: t3a.xlarge.
My application is using 7 different containers (cf image docker-ps.png):
A Django App, as an API (using python 3.6)
An Angular Application (using Angular2+)
A memcached server
A cerbot (using letsencrypt to automatically renew my SSL
certificats)
A Nginx, used as a reverse proxy to serve my angular application and
my Django API
A Postgres database
A Pgadmin in order to mananage my database
The issues happen when we send a push notification to our users using Firebase (around 42,000 users). The API is not responding during a certain amount of time: from 1min to 6min.
The Django API use the webserver Gunicorn (https://gunicorn.org/ ) with this configuration:
gunicorn xxxx_api.wsgi -b 0.0.0.0:80 --max-requests 500 --max-requests-jitter 50 --enable-stdio-inheritance -k gevent --workers=16 -t 80
The server or the container never crashed. When I check the metrics, we never use more than 60% of the CPU. Here is a screenshot of some metrics when the notification has been sent: https://ibb.co/Mc0v7R1
Is it because we are using too much bandwidth than our instance allowed us to use? Or should I use another AWS service?

Memory utilisation metrics are not captured for ec2 instances since OS level metrics are not available to AWS. You can collect custom metrics by your self
Reference:
https://awscloudengineer.com/create-custom-cloudwatch-metrics-centos-7/

I think your problem is about the design, you could try sending your push notifications as an async queue using things like SNS & SQS (it's AWS Way) or Celery & Redis (it's a traditional way)
If you choose the traditional way this post could help you
https://blog.devartis.com/sending-real-time-push-notifications-with-django-celery-and-redis-829c7f2a714f

I think Its because of queuing Http requests to firebase. I believe that you are sending 42000 firebase requests in a loop. I/O calls are blocking in nature. if you are running the Django app in single thread using gunicorn. these 42000 http calls will block the new calls until they are finished. they will stay in queue until the connection is alive or the requests are within nginx threshold. I don't think 42000 push notifications will exhaust memory and processing unless payload is too high.

Related

Running Flask port 80 on Elastic-Beanstalk Worker

Given an AWS Elastic-Beanstalk Worker box, is it possible to use Flask/port:80 to serve the messages coming in from the associated SQS queue?
I have seen conflicting information about what is going on, inside an ELB-worker. The ELB Worker Environment page says:
Elastic Beanstalk simplifies this process by managing the Amazon SQS queue and running a daemon process on each instance that reads from the queue for you. When the daemon pulls an item from the queue, it sends an HTTP POST request locally to http://localhost/ on port 80 with the contents of the queue message in the body. All that your application needs to do is perform the long-running task in response to the POST.
This SO question Differences in Web-server versus Worker says:
The most important difference in my opinion is that worker tier instances do not run web server processes (apache, nginx, etc).
Based on this, I would have expected that I could just run a Flask-server on port 80, and it would handle the SQS messages. However, the post appears incorrect. Even the ELB-worker boxes have Apache running on them, apparently for doing health-checks (when I stopped it, my server turned red). And of course it's using port 80...
I already have Flask/Gunicorn on an EC2 server that I was trying to move to ELB, and I would like to keep using that - is it possible? (Note: the queue-daemon only posts messages to port 80, that can't be changed...)
The docs aren't clear, but it sounds like they expect you to modify Apache to proxy to Flask, maybe? I hope that's not the only way.
Or, what is the "correct" way of setting up an ELB-worker to process the SQS messages? How are you supposed to "perform the long-running task"?
Note: now that I've used ELB more, and have a fairly good understanding of it - let me make clear that this it not the use-case that Amazon designed the ELB-workers for, and it has some glitches (which will be noted). The standard use-case, basically, is that you create a simple Flask app, and hook it into an ELB-EC2 server, that is configured to make it simple to run that Flask app.
My use-case was, I already had an EC2 server with a large Flask app, running under gunicorn, as well as various other things going on. I wanted to use that server (as an image) to build the ELB server, and have it respond to SQS-queue messages. It's possible there are better solutions, like just writing a queue-polling daemon, and that no-one else will ever take this option, but there it is...
The ELB worker is connected to an SQS queue, by a daemon that listens to that queue, and (internally) posts any messages to http://localhost:80. Apache is listening on port 80. This is to handle health-checks, which are done by the ELB manager (or something in the eco-system). Apache passes non-health-check-requests, using mod_wsgi, to the Flask app that was uploaded, which is at:
/opt/python/current/app/application.py
I suspect it would be possible but difficult to remove Apache and handle the health-checks some other way (flask), thus freeing up port 80. But that's enough of a change that I decided it wasn't worth it.
So the solution I found, is to change which port the local daemon posts to - by reconfiguring it via a YAML config-file, it will post to port 5001, where my Flask app was running. This mean Apache can continue to handle the health-checks on port 80, and Flask can handle the SQS messages from the daemon.
You configure the daemon, and stop/start it (as root):
/etc/aws-sqsd.d/default.yaml
/opt/elasticbeanstalk/addons/sqsd/hooks/stop-sqsd.sh
/opt/elasticbeanstalk/addons/sqsd/hooks/start-sqsd.sh
/opt/elasticbeanstalk/addons/sqsd/hooks/restart-sqsd.sh
Actual daemon:
/opt/elasticbeanstalk/lib/ruby/bin/aws-sqsd
/opt/elasticbeanstalk/lib/ruby/lib/ruby/gems/2.2.0/gems/aws-sqsd-2.3/bin/aws-sqsd
Glitches:
If you ever use the ELB GUI to configure daemon options, it will over-write the config-file, and you will have to re-edit the Port (and re-start the daemon).
Note: All of the HTTP traffic is internal, either to the ELB eco-system or the worker - so it is possible to close off all external ports (I keep 22 open), such as Port 80. Otherwise your Worker has Apache responding to http://:80 posts, meaning it's open to the world. I assume the server is configured fairly securely, but Port 80 doesn't need to be open at all, for healthchecks or anything else.

504 gateway timeout for any requests to Nginx with lot of free resources

We have been maintaining a project internally which has both web and mobile application platform. The backend of the project is developed in Django 1.9 (Python 3.4) and deployed in AWS.
The server stack consists of Nginx, Gunicorn, Django and PostgreSQL. We use Redis based cache server to serve resource intensive heavy queries. Our AWS resources include:
t1.medium EC2 (2 core, 4 GB RAM)
PostgreSQL RDS with one additional read-replica.
Right now Gunicorn is set to create 5 workers (by following the 2*n+1 rule). Load wise, there are like 20-30 mobile users making requests in every minute and there are 5-10 users checking the web panel every hour. So I would say, not very much load.
Now this setup works alright for 80% days. But when something goes wrong (for example, we detect a bug in the live system and we had to switch off the server for maintenance for few hours. In the mean time, the mobile apps have a queue of requests ready in their app. So when we make the backend live, a lot of users hit the system at the same time.), the server stops behaving normally and started responding with 504 gateway timeout error.
Surprisingly every time this happened, we found the server resources (CPU, Memory) to be free by 70-80% and the connection pool in the databases are mostly free.
Any idea where the problem is? How to debug? If you have already faced a similar issue, please share the fix.
Thank you,

Django+celery: how to show worker logs on the web server

I'm using Django+celery for my 1st ever web development project, and rabbitmq as the broker. My celery workers are running on a different system from the web server and are executing long-running tasks. During the task execution, the task output will be dumped to local log files on the workers. I'd like to display these task log files through the web server so the user can know in real-time where the execution is, but I've no idea how I should transfer these log files between the workers and the system where the web server is. Any suggestion is appreciated.
Do not move logs, just log to the same place. It can be really any database (relational or non-relational) accessible from the web server and Celery workers. You can even create (or look for) appropriate python logging handler, saving log records to the centralized storage.
Maybe the solution isn't to move the logs, but to aggregate them. Take a look at some logging tools like splunk, loggly or logscape.

Django on Apache - Prevent 504 Gateway Timeout

I have a Django server running on Apache via mod_wsgi. I have a massive background task, called via a API call, that searches emails in the background (generally takes a few hours) that is done in the background.
In order to facilitate debugging - as exceptions and everything else happen in the background - I created a API call to run the task blocking. So the browser actually blocks for those hours and receives the results.
In localhost this is fine. However, in the real Apache environment, after about 30 minutes I get a 504 Gateway Timeout error.
How do I change the settings so that Apache allows - just in this debug phase - for the HTTP request to block for a few hours without returning a 504 Gateway Timeout?
I'm assuming this can be changed in the Apache configuration.
You should not be doing long running tasks within Apache processes, nor even waiting for them. Use a background task queueing system such as Celery to run them. Have any web request return as soon as it is queued and implement some sort of polling mechanism as necessary to see if the job is complete and results can be obtained.
Also, are you sure the 504 isn't coming from some front end proxy (explicit or transparent) or load balancer? There is no default timeout in Apache which is 30 minutes.

Django and Celery Confusion

After reading a lot of blogposts, I decided to switch from crontab to Celery for my middle-scale Django project. I have a few things I didn't understand:
1- I'm planning to start a micro EC2 instance which will be dedicated to RabbitMQ, would this be sufficient for a small-to-medium heavy tasking? (Such as dispatching periodical e-mails to Amazon SES).
2- Computing of tasks, does compution of tasks occur on the Django server or the rabbitMQ server (assuming the rabbitMQ is on a seperate server)?
3- When I need to grow my system and have 2 or more application servers behind a load balancer, do these two celery machines need to connect to the same rabbitMQ vhost? Assuming application servers are the carbon copy and tasks are same and everything is sync on the database level.
I don't know the answer to this question, but you can definitely configure it to be suitable (e.g. use -c1 for a single process worker to avoid using much memory, or eventlet/gevent pools), see also the --autoscale option. The choice of broker transport also matters here, the ones that are not polling are more CPU effective (rabbitmq/redis/beanstalk).
Computing happens on the workers, the broker is only responsible for accepting, routing and delivering messages (and persisting messages to disk when necessary).
To add additional workers these should indeed connect to the same virtual host. You would
only use separate virtual hosts if you would want applications to have separate message buses.