Django REST Framework slow on Gunicorn - django

I deployed a new application made on Django REST Framework running with Gunicorn.
The application is deployed on 4 different servers and they are listening on a port (8001) which is consumed by an HAProxy load balancer.
Unfortunately I am suffering many performance problems: many requests takes seconds to be served, even 30 or 60 seconds.
Sometimes even the basic entry endpoint (like https://my.api/api/v2 which basically returns the list of the available endpoints) requires 10/20 seconds to respond.
I don't think the problem is the load balancer because I have latencies connecting directly to any backend servers with my client, so bypassing the load balancer.
And I don't think the problem is the database because the call to https://my.api/api/v2 as guest (not logged with any user) shouldn't make any database query.
This is a performance test made with hey (https://github.com/rakyll/hey) on the basic endpoint without authorization:
me#staging2:~$ hey -n 10000 -c 100 https://my.api/api/v2/
Summary:
Total: 38.9165 secs
Slowest: 18.6772 secs
Fastest: 0.0041 secs
Average: 0.3099 secs
Requests/sec: 256.9604
Total data: 20870000 bytes
Size/request: 2087 bytes
Response time histogram:
0.004 [1] |
1.871 [9723] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
3.739 [0] |
5.606 [175] |■
7.473 [0] |
9.341 [35] |
11.208 [29] |
13.075 [0] |
14.943 [0] |
16.810 [0] |
18.677 [37] |
Latency distribution:
10% in 0.0054 secs
25% in 0.0122 secs
50% in 0.0322 secs
75% in 0.2378 secs
90% in 0.3417 secs
95% in 0.3885 secs
99% in 8.5935 secs
Details (average, fastest, slowest):
DNS+dialup: 0.0021 secs, 0.0041 secs, 18.6772 secs
DNS-lookup: 0.0001 secs, 0.0000 secs, 0.0123 secs
req write: 0.0001 secs, 0.0000 secs, 0.0153 secs
resp wait: 0.3075 secs, 0.0039 secs, 18.6770 secs
resp read: 0.0001 secs, 0.0000 secs, 0.0150 secs
Status code distribution:
[200] 10000 responses
This is my Gunicorn configuration:
bind = '0.0.0.0:8001'
backlog = 2048
workers = 1
worker_class = 'sync'
worker_connections = 1000
timeout = 120
keepalive = 5
spew = False
daemon = False
pidfile = None
umask = 0
user = None
group = None
tmp_upload_dir = None
errorlog = '-'
loglevel = 'debug'
accesslog = '-'
access_log_format = '%(h)s %(l)s %(u)s %(t)s "%(r)s" %(s)s %(b)s "%(f)s" "%(a)s"'
And Gunicorn is currently running with the following command:
/path/to/application/bin/python3 /path/to/application/bin/gunicorn --env DJANGO_SETTINGS_MODULE=settings.production -c /path/to/application/settings/gunicorn_conf.py --user=deployer --log-file=/path/to/application/django-application-gunicorn.log --chdir /path/to/application/django-application --daemon wsgi:application
What tests I can do to find out what is causing my performance problems?

Related

Gunicorn increase timeout doesn't takes effect

I have a minimal flask API deployed on digital ocean which throws this error when I do a large request:
[1] [CRITICAL] WORKER TIMEOUT (pid:16)
I tried:
/env/Lib/site-packages/gunicorn-config.py (0 is unlimited, but I tried 500 etc.)
class Timeout(Setting):
name = "timeout"
section = "Worker Processes"
cli = ["-t", "--timeout"]
meta = "INT"
validator = validate_pos_int
type = int
default = 0
desc = """\
gunicorn_config.py
bind = "0.0.0.0:8080"
workers = 2
timeout = 300
Procfile
web: gunicorn app:app --timeout 300
But I get the same error on the digital ocean console, and on the browser frontend I get the 504 error after about 30 secs which means I'm not overwriting the 30 seconds gunicorn default.
Smaller/faster requests they work good.
Any idea what can be the issue? Thanks so much!

Why maxRequestPerConnection of istio does effect to http/1.1 requests?

I'm just learning service mesh using istio and I found a strange behavior.
To understand maxRequestsPerConnection of Istio DestinationRule CRD, I write the below manifest and apply it.
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: httpbin
spec:
host: httpbin
trafficPolicy:
connectionPool:
tcp:
maxConnections: 1
http:
http1MaxPendingRequests: 1
maxRequestsPerConnection: 1
And then, I sent requests using fortio. The result is below:
yunoMacBook-Air:labo8 yu$ kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 5 -qps 0 -n 1000 -loglevel Error http://httpbin:8000/get
07:12:01 I logger.go:127> Log level is now 4 Error (was 2 Info)
Fortio 1.11.3 running at 0 queries per second, 2->2 procs, for 1000 calls: http://httpbin:8000/get
Aggregated Function Time : count 1000 avg 0.0036879818 +/- 0.004588 min 0.000379697 max 0.034176044 sum 3.68798183
# target 50% 0.00234783
# target 75% 0.0032551
# target 90% 0.008
# target 99% 0.025
# target 99.9% 0.032784
Sockets used: 876 (for perfect keepalive, would be 5)
Jitter: false
Code 200 : 126 (12.6 %)
Code 503 : 874 (87.4 %)
All done 1000 calls (plus 0 warmup) 3.688 ms avg, 1170.1 qps
yunoMacBook-Air:labo8 yu$
After that, I changed maxRequestsPerConnection value to 10:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: httpbin
spec:
(...)
maxRequestsPerConnection: 10
and I sent requests again with the same settings.
yunoMacBook-Air:labo8 yu$ kubectl exec "$FORTIO_POD" -c fortio -- /usr/bin/fortio load -c 5 -qps 0 -n 1000 -loglevel Error http://httpbin:8000/get
07:11:07 I logger.go:127> Log level is now 4 Error (was 2 Info)
Fortio 1.11.3 running at 0 queries per second, 2->2 procs, for 1000 calls: http://httpbin:8000/get
Aggregated Function Time : count 1000 avg 0.0039736575 +/- 0.004068 min 0.000404827 max 0.030141552 sum 3.97365754
# target 50% 0.00231923
# target 75% 0.00475
# target 90% 0.0104667
# target 99% 0.0192
# target 99.9% 0.025
Sockets used: 723 (for perfect keepalive, would be 5)
Jitter: false
Code 200 : 281 (28.1 %)
Code 503 : 719 (71.9 %)
All done 1000 calls (plus 0 warmup) 3.974 ms avg, 1098.3 qps
yunoMacBook-Air:labo8 yu$
200 rate increased and I cannot understand why it happened.
In my understanding, fortio uses http/1.1 and only one HTTP request is in one TCP connection when I use http/1.1. So I expected that I get the same results.
Could you tell me why this happened?
First things first: HTTP/1.1 does allow multiple request per connection with Keep-Alive header. This is the default behavior (RFC 2616, Section 8.1).
The documentation is a bit unclear.
maxRequestsPerConnection description states:
Maximum number of requests per connection to a backend. Setting this parameter to 1 disables keep alive. Default 0, meaning “unlimited”, up to 2^29.
Setting maxRequestsPerConnection to 1 disables Keep-Alive. Setting it to any other value (value > 1) switches Keep-Alive back on.
Setting this field to proper value (not too high, not too low) is the hard part of configuring Istio, and is dependent on your application needs and traffic.

Session should expire after inactivity in superset

I'm currently using Superset 0.28.1
Session should expire after inactivity.
1) Tried by setting SUPERSET_WEBSERVER_TIMEOUT, SUPERSET_TIMEOUT parameter in config.py.
2) Tried with below commands while starting the server ::
superset runserver -t 120
superset --timeout 60
gunicorn -w 2 --timeout 60 -b 0.0.0.0:9004 --limit-request-line 0 --limit-request-field_size 0 superset:app
Put PERMANENT_SESSION_LIFETIME config to your config file. The value is second.
eg.
PERMANENT_SESSION_LIFETIME = 60 means expire after inactivity 1 minute

Is possible to avoid the 60 seconds limit in urllib2.urlopen with GAE?

I'm requesting a file with a size around 14MB from a slow server with urllib2.urlopen, and it spend more than 60 seconds to get the data, and I'm getting the error:
Deadline exceeded while waiting for HTTP response from URL:
http://bigfile.zip?type=CSV
Here my code:
class CronChargeBT(webapp2.RequestHandler):
def get(self):
taskqueue.add(queue_name = 'optimized-queue', url='/cronChargeBTB')
class CronChargeBTB(webapp2.RequestHandler):
def post(self):
url = "http://bigfile.zip?type=CSV"
url_request = urllib2.Request(url)
url_request.add_header('Accept-encoding', 'gzip')
urlfetch.set_default_fetch_deadline(300)
response = urllib2.urlopen(url_request, timeout=300)
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
...work with the data insiste the file...
I create a cron task who calls CronChargeBT. Here the cron.yaml:
- description: cargar BlueTomato
url: /cronChargeBT
target: charge
schedule: every wed,sun 01:00
and it create a new task and insert into a queue, here the queue configuration:
- name: optimized-queue
rate: 40/s
bucket_size: 60
max_concurrent_requests: 10
retry_parameters:
task_retry_limit: 1
min_backoff_seconds: 10
max_backoff_seconds: 200
Of coursethat the timeout=300 isn't working because the 60seconds limit in GAE but I think yhat I can avoid it using a task... anyone knows how I can get the data in the file avoiding this timeout.
Thanks a lot!!!
Cron jobs are limited to 10 minutes deadline, not 60 seconds. If your download fails, perhaps just retry? Does the download work if you download it from your computer? There's nothing you can do on GAE if the server you are downloading from is too slow or unstable.
Edit: According to https://cloud.google.com/appengine/docs/java/outbound-requests#request_timeouts, there is a maximum deadline of 60 seconds for cron job requests. Therefore, you can't get around it.

Tuning gunicorn (with Django): Optimize for more concurrent connections and faster connections

I use Django 1.5.3 with gunicorn 18.0 and lighttpd. I serve my static and dynamic content like that using lighttpd:
$HTTP["host"] == "www.mydomain.com" {
$HTTP["url"] !~ "^/media/|^/static/|^/apple-touch-icon(.*)$|^/favicon(.*)$|^/robots\.txt$" {
proxy.balance = "hash"
proxy.server = ( "" => ("myserver" =>
( "host" => "127.0.0.1", "port" => 8013 )
))
}
$HTTP["url"] =~ "^/media|^/static|^/apple-touch-icon(.*)$|^/favicon(.*)$|^/robots\.txt$" {
alias.url = (
"/media/admin/" => "/var/www/virtualenvs/mydomain/lib/python2.7/site-packages/django/contrib/admin/static/admin/",
"/media" => "/var/www/mydomain/mydomain/media",
"/static" => "/var/www/mydomain/mydomain/static"
)
}
url.rewrite-once = (
"^/apple-touch-icon(.*)$" => "/media/img/apple-touch-icon$1",
"^/favicon(.*)$" => "/media/img/favicon$1",
"^/robots\.txt$" => "/media/robots.txt"
)
}
I already tried to run gunicorn (via supervisord) in many different ways, but I cant get it better optimized than it can handle about 1100 concurrent connections. In my project I need about 10000-15000 connections
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 9 -k gevent --preload --settings=myproject.settings
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 10 -k eventlet --worker_connections=1000 --settings=myproject.settings --max-requests=10000
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 20 -k gevent --settings=myproject.settings --max-requests=1000
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 40 --settings=myproject.settings
On the same server there live about 10 other projects, but CPU and RAM is fine, so this shouldnt be a problem, right?
I ran a load test and these are the results:
At about 1100 connections my lighttpd errorlog says something like that, thats where the load test shows the drop of connections:
2013-10-31 14:06:51: (mod_proxy.c.853) write failed: Connection timed out 110
2013-10-31 14:06:51: (mod_proxy.c.939) proxy-server disabled: 127.0.0.1 8013 83
2013-10-31 14:06:51: (mod_proxy.c.1316) no proxy-handler found for: /
... after about one minute
2013-10-31 14:07:02: (mod_proxy.c.1361) proxy - re-enabled: 127.0.0.1 8013
These things also appear ever now and then:
2013-10-31 14:06:55: (network_linux_sendfile.c.94) writev failed: Connection timed out 600
2013-10-31 14:06:55: (mod_proxy.c.853) write failed: Connection timed out 110
...
2013-10-31 14:06:57: (mod_proxy.c.828) establishing connection failed: Connection timed out
2013-10-31 14:06:57: (mod_proxy.c.939) proxy-server disabled: 127.0.0.1 8013 45
So how can I tune gunicorn/lighttpd to serve more connections faster? What can I optimize? Do you know any other/better setup?
Thanks alot in advance for your help!
Update: Some more server info
root#django ~ # top
top - 15:28:38 up 100 days, 9:56, 1 user, load average: 0.11, 0.37, 0.76
Tasks: 352 total, 1 running, 351 sleeping, 0 stopped, 0 zombie
Cpu(s): 33.0%us, 1.6%sy, 0.0%ni, 64.2%id, 0.4%wa, 0.0%hi, 0.7%si, 0.0%st
Mem: 32926156k total, 17815984k used, 15110172k free, 342096k buffers
Swap: 23067560k total, 0k used, 23067560k free, 4868036k cached
root#django ~ # iostat
Linux 2.6.32-5-amd64 (django.myserver.com) 10/31/2013 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
33.00 0.00 2.36 0.40 0.00 64.24
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 137.76 980.27 2109.21 119567783 257268738
sdb 24.23 983.53 2112.25 119965731 257639874
sdc 24.25 985.79 2110.14 120241256 257382998
md0 0.00 0.00 0.00 400 0
md1 0.00 0.00 0.00 284 6
md2 1051.93 38.93 4203.96 4748629 512773952
root#django ~ # netstat -an |grep :80 |wc -l
7129
Kernel Settings:
echo "10152 65535" > /proc/sys/net/ipv4/ip_local_port_range
sysctl -w fs.file-max=128000
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.core.somaxconn=250000
sysctl -w net.ipv4.tcp_max_syn_backlog=2500
sysctl -w net.core.netdev_max_backlog=2500
ulimit -n 10240