My question is about what "cluster.CircuitBreakers.Thresholds.max_connections" really mean in Envoy.
cluster.CircuitBreakers.Thresholds.max_connections in envoy doc is explained as :
The maximum number of connections that Envoy will make to the upstream cluster. If not specified, the default is 1024.
Istio uses envoy as a sidecar. Recently we tried the circuit breaking sample but always found that there are more connections than we configured.
So we make another test shown below:
Add two services into istio:
echo client: 1 pod, downstream, will send HTTP request to echo server
echo server: 2 pods, upstream.
the service pods:
[root#k8s-master istio-1.0.3]# kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE
echoclient-84485fbc5c-zxlv8 2/2 Running 0 8s 10.244.2.79 node02 <none>
echoserver-5655768fb9-smsvb 2/2 Running 0 23h 10.244.2.65 node02 <none>
echoserver-5655768fb9-srsq2 2/2 Running 0 7h52m 10.244.2.73 node02 <none>
Configured destination rule for echo server, below shows the corresponding info in the envoy. (maxConnections is 2)
istio proxy-config output:
[root#k8s-master istio-1.0.3]# istioctl proxy-config clusters echoclient-84485fbc5c-zxlv8 --fqdn echoserver.default.svc.cluster.local -o json
[
{
"name": "outbound|8080||echoserver.default.svc.cluster.local",
"type": "EDS",
"edsClusterConfig": {
"edsConfig": {
"ads": {}
},
"serviceName": "outbound|8080||echoserver.default.svc.cluster.local"
},
"connectTimeout": "1.000s",
"circuitBreakers": {
"thresholds": [
{
"maxConnections": 2
}
]
}
}
]
Make multiple requests concurrently(40 requests per thread) from echo client to each server.
results:
[root#k8s-master istio-1.0.3]# kubectl exec -it echoclient-84485fbc5c-zxlv8 /bin/bash
Defaulting container name to echoclient.
Use 'kubectl describe pod/echoclient-84485fbc5c-zxlv8 -n default' to see all of the containers in this pod.
[root#echoclient-84485fbc5c-zxlv8 /]# /opt/jre/bin/java -cp /opt/echoclient-1.0-SNAPSHOT-jar-with-dependencies.jar hello.HttpSender "http://echoserver:8080/echo?name=peter" 10 40 0
using num threads: 10
Starting pool-1-thread-1 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-2 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-3 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-4 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-5 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-6 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-7 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-8 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-9 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
Starting pool-1-thread-10 with numCalls=40 parallelSends=false delayBetweenCalls=0 url=http://echoserver:8080/echo?name=peter mixedRespTimes=false
pool-1-thread-7: successes=[40], failures=[0], duration=[481ms]
pool-1-thread-6: successes=[40], failures=[0], duration=[485ms]
pool-1-thread-4: successes=[40], failures=[0], duration=[504ms]
pool-1-thread-1: successes=[40], failures=[0], duration=[542ms]
pool-1-thread-9: successes=[40], failures=[0], duration=[626ms]
pool-1-thread-8: successes=[40], failures=[0], duration=[652ms]
pool-1-thread-2: successes=[40], failures=[0], duration=[684ms]
pool-1-thread-10: successes=[40], failures=[0], duration=[657ms]
pool-1-thread-5: successes=[40], failures=[0], duration=[678ms]
pool-1-thread-3: successes=[40], failures=[0], duration=[696ms]
Check the HTTP connection from echo client to echo server
connection info from netstat:
[root#echoclient-84485fbc5c-zxlv8 /]# netstat -ano | grep 8080 | grep ESTABLISHED
tcp 0 0 10.244.2.79:58074 10.244.2.65:8080 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.244.2.79:38076 10.244.2.73:8080 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.244.2.79:58088 10.244.2.65:8080 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.244.2.79:38080 10.244.2.73:8080 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.244.2.79:58056 10.244.2.65:8080 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.244.2.79:38094 10.244.2.73:8080 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.244.2.79:38110 10.244.2.73:8080 ESTABLISHED off (0.00/0/0)
tcp 0 0 10.244.2.79:58076 10.244.2.65:8080 ESTABLISHED off (0.00/0/0)
connection info from envoy cluster:
[root#echoclient-84485fbc5c-zxlv8 /]# curl -s http://localhost:15000/clusters | grep echoserver
outbound|8080||echoserver.default.svc.cluster.local::default_priority::max_connections::2
outbound|8080||echoserver.default.svc.cluster.local::default_priority::max_pending_requests::1024
outbound|8080||echoserver.default.svc.cluster.local::default_priority::max_requests::1024
outbound|8080||echoserver.default.svc.cluster.local::default_priority::max_retries::3
outbound|8080||echoserver.default.svc.cluster.local::high_priority::max_connections::1024
outbound|8080||echoserver.default.svc.cluster.local::high_priority::max_pending_requests::1024
outbound|8080||echoserver.default.svc.cluster.local::high_priority::max_requests::1024
outbound|8080||echoserver.default.svc.cluster.local::high_priority::max_retries::3
outbound|8080||echoserver.default.svc.cluster.local::added_via_api::true
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::cx_active::4
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::cx_connect_fail::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::cx_total::4
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::rq_active::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::rq_error::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::rq_success::200
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::rq_timeout::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::rq_total::200
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::health_flags::healthy
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::weight::1
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::region::
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::zone::
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::sub_zone::
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::canary::false
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.65:8080::success_rate::-1
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::cx_active::4
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::cx_connect_fail::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::cx_total::4
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::rq_active::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::rq_error::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::rq_success::200
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::rq_timeout::0
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::rq_total::200
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::health_flags::healthy
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::weight::1
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::region::
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::zone::
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::sub_zone::
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::canary::false
outbound|8080||echoserver.default.svc.cluster.local::10.244.2.73:8080::success_rate::-1
We can see that there are 8 connections from echoclient to echoserver(10.244.2.65,10.244.2.73), but not the configured maxConnections 2.
Why there are 8 connections but not 2?
Is there any misunderstanding about the maxConnections of envoy?
As Garrett mention in comments:
max_connections or max_requests refers to the number of connections each children of the php pool can take before it respawns and can be typically found in /etc/etc/php/{version}/fpm/pool.d/www.conf max_children is based on the memory of the machine, some example tutorials that were helpful for me to understand were Max Requests / Children
I think, you invoked the client with 1 thread (ie, 1 HTTP connection) but it sends requests in parallel (in batches of 10 by default)
In envoy, max_connections apply to http1 connections and in your case, you have just a single http connection.
Try taking a look at max_request which apply potentially to threads that have a close relation to http2.
Hope this helps!!
SSH to node of client pod, and exec into istio-proxy container
docker exec --privileged --user root -it <istio-proxy-container-id> bash
Use following command to find tcp connections envoy make to upstream
ss -pe | grep 8080 | grep envoy
I think envoy can make upstream connections up to max_connections per one envoy worker. If you run envoy with one concurrency option, you can check only two upstream connections are created.
$ envoy -c envoy.yaml --concurrency 1
Related
I'm using RabbitMQ 3.8.2 with Erlang 22.2.7 and having a problem while consuming tasks. My configuration is django-celery-rabbitmq. While publishing messages in a queue everything goes ok until the length of the queue reaches 1200 messages. After this point RabbitMQ starts to close AMQP connection with following errors:
...
2022-11-01 09:35:25.327 [info] <0.20608.9> accepting AMQP connection <0.20608.9> (185.121.83.107:60447 -> 185.121.83.116:5672)
2022-11-01 09:35:25.483 [info] <0.20608.9> connection <0.20608.9> (185.121.83.107:60447 -> 185.121.83.116:5672): user 'rabbit_admin' authenticated and granted access to vhost '/'
...
2022-11-01 09:36:59.129 [warning] <0.19994.9> closing AMQP connection <0.19994.9> (185.121.83.108:36149 -> 185.121.83.116:5672, vhost: '/', user: 'rabbit_admin'):
client unexpectedly closed TCP connection
...
[error] <0.11162.9> closing AMQP connection <0.11162.9> (185.121.83.108:57631 -> 185.121.83.116:5672):
{writer,send_failed,{error,enotconn}}
...
2022-11-01 09:35:48.256 [error] <0.20201.9> closing AMQP connection <0.20201.9> (185.121.83.108:50058 -> 185.121.83.116:5672):
{inet_error,enotconn}
...
Then the django-celery consumer disappears from queue list, messages become "ready" and celery pods are unable to ack the message after the job is finished with the following error:
ERROR: [2022-11-01 09:20:23] /usr/src/app/project/celery.py:114 handle_message Error while handling Rabbit task: [Errno 104] Connection reset by peer
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/amqp/connection.py", line 514, in channel
return self.channels[channel_id]
KeyError: None
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/src/app/project/celery.py", line 76, in handle_message
message.ack()
File "/usr/local/lib/python3.10/site-packages/kombu/message.py", line 125, in ack
self.channel.basic_ack(self.delivery_tag, multiple=multiple)
File "/usr/local/lib/python3.10/site-packages/amqp/channel.py", line 1407, in basic_ack
return self.send_method(
File "/usr/local/lib/python3.10/site-packages/amqp/abstract_channel.py", line 70, in send_method
conn.frame_writer(1, self.channel_id, sig, args, content)
File "/usr/local/lib/python3.10/site-packages/amqp/method_framing.py", line 186, in write_frame
write(buffer_store.view[:offset])
File "/usr/local/lib/python3.10/site-packages/amqp/transport.py", line 347, in write
self._write(s)
ConnectionResetError: [Errno 104] Connection reset by peer
I have noticed that the message size also affects this behavior. In the above case there are like 1000-1500 symbols in each message. If I decrease it to 50 symbols, then the threshold at which RabbitMQ starts to close AMQP connection shifts to 4000-5000 messages.
I suspect that the problem is with lack of resources for RabbitMQ, but I don't know how find what exactly is going wrong. If I run htop on the server, I see that 2 available CPU are not at high load at any time (loaded less than 20% each) and RAM is 400mb / 3840mb used. So nothing seems to be wrong. Is there any resource checking command for RabbitMQ? The tasks do not take long time to complete, about 10 seconds each.
Also maybe there are some missing heartbeats from the client (I had the problem earlier, but not now, there are currently no error messages about that).
Also if I run sudo journalctl --system | grep rabbitmq, I get the following output:
......
Мау 24 05:15:49 oms-git.omsystem sshd[809111]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=43.154.63.169 user=rabbitmq
Мау 24 05:15:51 oms-git.omsystem sshd[809111]: Failed password for rabbitmq from 43.154.63.169 port 37010 ssh2
Мау 24 05:15:51 oms-git.omsystem sshd[809111]: Disconnected from authenticating user rabbitmq 43.154.63.169 port 37010 [preauth]
Мау 24 16:12:32 oms-git.omsystem sudo[842182]: ad : TTY=pts/3 ; PWD=/var/log/rabbitmq ; USER=root ; COMMAND=/usr/bin/tail -f -n 1000 rabbit#XXX-git.log
......
Maybe here is another issue with firewall, but I don't see any error messages about that in /var/log/rabbitmq/rabbit#XXX.log.
My Celery configuration on client is like:
CELERY_TASK_IGNORE_RESULT = True
CELERY_RESULT_BACKEND = 'django-db'
CELERY_CACHE_BACKEND = 'django-cache'
CELERY_SEND_EVENTS = False
CELERY_BROKER_POOL_LIMIT = 30
CELERY_BROKER_HEARTBEAT = 30
CELERY_BROKER_CONNECTION_TIMEOUT = 600
CELERY_PREFETCH_MULTIPLIER = 1
CELERY_SEND_EVENTS = False
CELERY_WORKER_CONCURRENCY = 1
CELERY_TASK_ACKS_LATE = True
Currently I'm running the pod using following command:
celery -A project.celery worker -l info -f /var/log/celery/celery.log -Ofair
Also I have tried to use various arguments to limit prefetch or turn off heartbit but it didn't work:
celery -A project.celery worker -l info -f /var/log/celery/celery.log --without-heartbeat --without-gossip --without-mingle
celery -A project.celery worker -l info -f /var/log/celery/celery.log --prefetch-multiplier=1 --pool=solo --
I expect that there are no limitations on queue length and every celery pod in my kubernetes cluster consumes and acks messages without errors.
I am trying to set a HTCondor batch system, but when I do condor_status it only shows the master in both the master and worker nodes. They both show this:
Name OpSys Arch State Activity LoadAv Mem
[master ip] LINUX X86_64 Unclaimed Idle 0.000 973
Total Owner Claimed Unclaimed Matched Preempting Backfill Drain
X86_64/LINUX 1 0 0 1 0 0 0 0
Total 1 0 0 1 0 0 0 0
Condor_restart on the master node works fine, but on the worker nodes yields this error:
ERROR
SECMAN:2010:Received "DENIED" from server for user unauthenticated#unmapped using no authentication method, which may imply host-based security. Our address was '[ip address of master]', and server's address was '[ip address of worker]'. Check your ALLOW settings and IP protocols.
Here are the config files:
of the master node:
CONDOR_HOST = [private ip of master]
DAEMON_LIST = COLLECTOR, MASTER, NEGOTIATOR, SCHEDD, STARTD
# to avoid user authentication
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *
HOSTALLOW_ADMINISTRATOR = *
of the worker node:
CONDOR_HOST = [private ip of master]
DAEMON_LIST = MASTER, STARTD
# to avoid user authentication
HOSTALLOW_READ = *
HOSTALLOW_WRITE = *
HOSTALLOW_ADMINISTRATOR = *
I am allowing on the same security group:
All TCP TCP 0 - 65535
All ICMP-IPv4 All
SSH on port 22
This is how it looks like (security group ending in '6')
Apparently the issue was running condor_reconfig -full. I just reinstalled it without doing that and using systemctl restart condor instead and it worked. If someone wants to bring some insight on why it was so please do so :)
Elastic Beanstalk is adding & removing instances one after the other. Googling around points to checking the "State transition message" which is coming up as "Client.UserInitiatedShutdown: User initiated shutdown" for which https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/troubleshooting-launch.html#troubleshooting-launch-internal states some possible reasons but none of these apply. No one has touched any setting, etc. Any ideas?
UPDATE: Did a bit more digging and found out that app deployment is failing. Relevant log errors are below:
eb-engine.log
2021/08/05 15:46:29.272215 [INFO] Executing instruction: PreBuildEbExtension
2021/08/05 15:46:29.272220 [INFO] Starting executing the config set Infra-EmbeddedPreBuild.
2021/08/05 15:46:29.272235 [INFO] Running command /bin/sh -c /opt/aws/bin/cfn-init -s arn:aws:cloudformation:us-east-1:345470085661:stack/awseb-e-mecfm5qc8z-stack/317924c0-a106-11ea-a8a3-12498e67507f -r AWSEBAutoScalingGroup --region us-east-1 --configsets Infra-EmbeddedPreBuild
2021/08/05 15:50:44.538818 [ERROR] An error occurred during execution of command [app-deploy] - [PreBuildEbExtension]. Stop running the command. Error: EbExtension build failed. Please refer to /var/log/cfn-init.log for more details.
2021/08/05 15:50:44.540438 [INFO] Executing cleanup logic
2021/08/05 15:50:44.581445 [INFO] CommandService Response: {"status":"FAILURE","api_version":"1.0","results":[{"status":"FAILURE","msg":"Engine execution has encountered an error.","returncode":1,"events":[{"msg":"Instance deployment failed. For details, see 'eb-engine.log'.","timestamp":1628178644,"severity":"ERROR"}]}]}
2021/08/05 15:50:44.620394 [INFO] Platform Engine finished execution on command: app-deploy
2021/08/05 15:51:22.196186 [ERROR] An error occurred during execution of command [self-startup] - [PreBuildEbExtension]. Stop running the command. Error: EbExtension build failed. Please refer to /var/log/cfn-init.log for more details.
2021/08/05 15:51:22.196215 [INFO] Executing cleanup logic
eb-cfn-init.log
[2021-08-05T15:42:44.199Z] Completed executing cfn_init.
[2021-08-05T15:42:44.226Z] finished _OnInstanceReboot
+ RESULT=1
+ [[ 1 -ne 0 ]]
+ sleep_delay
+ (( 2 < 3600 ))
+ echo Sleeping 2
Sleeping 2
+ sleep 2
+ SLEEP_TIME=4
+ true
+ curl https://elasticbeanstalk-platform-assets-us-east-1.s3.amazonaws.com/stalks/eb_php74_amazon_linux_2_1.0.1153.0_20210728213922/lib/UserDataScript.sh
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 4627 100 4627 0 0 24098 0 --:--:-- --:--:-- --:--:-- 24098
+ RESULT=0
+ [[ 0 -ne 0 ]]
+ SLEEP_TIME=2
+ /bin/bash /tmp/ebbootstrap.sh 'https://cloudformation-waitcondition-us-east-1.s3.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-east-1%3A345470085661%3Astack/awseb-e-mecfm5qc8z-stack/317924c0-a106-11ea-a8a3-12498e67507f/AWSEBInstanceLaunchWaitHandle?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200528T171102Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86399&X-Amz-Credential=AKIAIIT3CWAIMJYUTISA%2F20200528%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=57c7da0aec730af1b425d1aff68517c333cf9d5432c984d775419b415cac8513' arn:aws:cloudformation:us-east-1:345470085661:stack/awseb-e-mecfm5qc8z-stack/317924c0-a106-11ea-a8a3-12498e67507f 65c52bb7-0376-4d43-b304-b64890a34c1c https://elasticbeanstalk-health.us-east-1.amazonaws.com '' https://elasticbeanstalk-platform-assets-us-east-1.s3.amazonaws.com/stalks/eb_php74_amazon_linux_2_1.0.1153.0_20210728213922 us-east-1
[2021-08-05T15:46:07.683Z] Started EB Bootstrapping Script.
[2021-08-05T15:46:07.739Z] Received parameters:
TARBALLS =
EB_GEMS =
SIGNAL_URL = https://cloudformation-waitcondition-us-east-1.s3.amazonaws.com/arn%3Aaws%3Acloudformation%3Aus-east-1%3A345470085661%3Astack/awseb-e-mecfm5qc8z-stack/317924c0-a106-11ea-a8a3-12498e67507f/AWSEBInstanceLaunchWaitHandle?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200528T171102Z&X-Amz-SignedHeaders=host&X-Amz-Expires=86399&X-Amz-Credential=AKIAIIT3CWAIMJYUTISA%2F20200528%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Signature=57c7da0aec730af1b425d1aff68517c333cf9d5432c984d775419b415cac8513
STACK_ID = arn:aws:cloudformation:us-east-1:345470085661:stack/awseb-e-mecfm5qc8z-stack/317924c0-a106-11ea-a8a3-12498e67507f
REGION = us-east-1
GUID =
HEALTHD_GROUP_ID = 65c52bb7-0376-4d43-b304-b64890a34c1c
HEALTHD_ENDPOINT = https://elasticbeanstalk-health.us-east-1.amazonaws.com
PROXY_SERVER =
HEALTHD_PROXY_LOG_LOCATION =
PLATFORM_ASSETS_URL = https://elasticbeanstalk-platform-assets-us-east-1.s3.amazonaws.com/stalks/eb_php74_amazon_linux_2_1.0.1153.0_20210728213922
Is this some corrupted AMI?
Turned out that there was a config script in the .ebextension directory that was not behaving.
I am trying to daemonize my celery/redis workers on Ubuntu 18.04 and I am making progress! Celery is now running, but it does appear to be communicating with my django app. I found that removing the directive Type=forking from the celery.service file, celery started working.
# systemctl status celery.service
● celery.service - Celery Service
Loaded: loaded (/etc/systemd/system/celery.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2020-12-17 18:35:19 MST; 1min 52s ago
Main PID: 21509 (code=exited, status=1/FAILURE)
Tasks: 0 (limit: 4915)
CGroup: /system.slice/celery.service
Dec 17 18:35:17 t-rex systemd[1]: Starting Celery Service...
Dec 17 18:35:19 t-rex sh[24331]: celery multi v4.3.0 (rhubarb)
Dec 17 18:35:19 t-rex sh[24331]: > Starting nodes...
Dec 17 18:35:19 t-rex sh[24331]: > w1#t-rex: OK
Dec 17 18:35:19 t-rex sh[24331]: > w2#t-rex: OK
Dec 17 18:35:19 t-rex sh[24331]: > w3#t-rex: OK
Dec 17 18:35:19 t-rex systemd[1]: Started Celery Service.
When I test celery from the python prompt in my apps virtualenv the test fails. This is the test I use in my app before I call a celery task.
>>> celery_app.control.broadcast('ping', reply=True, limit=1)
[]
My celery.service file (straight from the celery docs) with a few local changes.
[Unit]
Description=Celery Service
After=network.target redis.service
Requires=redis.service
[Service]
#Type=forking
User=www-data
Group=www-data
EnvironmentFile=/etc/conf.d/celery
WorkingDirectory=/home/mark/python-projects/archive
ExecStart=/bin/sh -c '${CELERY_BIN} -A $CELERY_APP multi start $CELERYD_NODES \
--pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE} \
--loglevel="${CELERYD_LOG_LEVEL}" $CELERYD_OPTS'
ExecStop=/bin/sh -c '${CELERY_BIN} multi stopwait $CELERYD_NODES \
--pidfile=${CELERYD_PID_FILE} --loglevel="${CELERYD_LOG_LEVEL}"'
ExecReload=/bin/sh -c '${CELERY_BIN} -A $CELERY_APP multi restart $CELERYD_NODES \
--pidfile=${CELERYD_PID_FILE} --logfile=${CELERYD_LOG_FILE} \
--loglevel="${CELERYD_LOG_LEVEL}" $CELERYD_OPTS'
Restart=always
[Install]
WantedBy=multi-user.target
and my environment file (also from the same celery docs):
# Name of nodes to start
# here we have a single node
CELERYD_NODES="w1 w2 w3"
# or we could have three nodes:
#CELERYD_NODES="w1 w2 w3"
# Absolute or relative path to the 'celery' command:
CELERY_BIN="/home/mark/.virtualenvs/archive/bin/celery"
#CELERY_BIN="/virtualenvs/def/bin/celery"
# App instance to use
# comment out this line if you don't use an app
CELERY_APP="MemorabiliaJSON"
# or fully qualified:
#CELERY_APP="proj.tasks:app"
# How to call manage.py
CELERYD_MULTI="multi"
# Extra command-line arguments to the worker
CELERYD_OPTS="--time-limit=300 --concurrency=8"
# - %n will be replaced with the first part of the nodename.
# - %I will be replaced with the current child process index
# and is important when using the prefork pool to avoid race conditions.
CELERYD_PID_FILE="/var/run/celery/%n.pid"
CELERYD_LOG_FILE="/var/log/celery/%n%I.log"
CELERYD_LOG_LEVEL="DEBUG"
The redis server is running, so that not be the issue. I am not sure if redis is talking to my daemonized celery or not.
I start celery with "celery -A MemorabiliaJSON worker -l debug" when using django runserver, and I am not sure if my daemonized celery needs something else to make it talk to my django apps.
Is there any magic needed to get django/apache/wsgi to work with daemonized celery? There is nothing in the celery log files when I try my test above.
Thanks for any assistance you can give me in debugging this problem!
Mark
I use Django 1.5.3 with gunicorn 18.0 and lighttpd. I serve my static and dynamic content like that using lighttpd:
$HTTP["host"] == "www.mydomain.com" {
$HTTP["url"] !~ "^/media/|^/static/|^/apple-touch-icon(.*)$|^/favicon(.*)$|^/robots\.txt$" {
proxy.balance = "hash"
proxy.server = ( "" => ("myserver" =>
( "host" => "127.0.0.1", "port" => 8013 )
))
}
$HTTP["url"] =~ "^/media|^/static|^/apple-touch-icon(.*)$|^/favicon(.*)$|^/robots\.txt$" {
alias.url = (
"/media/admin/" => "/var/www/virtualenvs/mydomain/lib/python2.7/site-packages/django/contrib/admin/static/admin/",
"/media" => "/var/www/mydomain/mydomain/media",
"/static" => "/var/www/mydomain/mydomain/static"
)
}
url.rewrite-once = (
"^/apple-touch-icon(.*)$" => "/media/img/apple-touch-icon$1",
"^/favicon(.*)$" => "/media/img/favicon$1",
"^/robots\.txt$" => "/media/robots.txt"
)
}
I already tried to run gunicorn (via supervisord) in many different ways, but I cant get it better optimized than it can handle about 1100 concurrent connections. In my project I need about 10000-15000 connections
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 9 -k gevent --preload --settings=myproject.settings
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 10 -k eventlet --worker_connections=1000 --settings=myproject.settings --max-requests=10000
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 20 -k gevent --settings=myproject.settings --max-requests=1000
command = /var/www/virtualenvs/myproject/bin/python /var/www/myproject/manage.py run_gunicorn -b 127.0.0.1:8013 -w 40 --settings=myproject.settings
On the same server there live about 10 other projects, but CPU and RAM is fine, so this shouldnt be a problem, right?
I ran a load test and these are the results:
At about 1100 connections my lighttpd errorlog says something like that, thats where the load test shows the drop of connections:
2013-10-31 14:06:51: (mod_proxy.c.853) write failed: Connection timed out 110
2013-10-31 14:06:51: (mod_proxy.c.939) proxy-server disabled: 127.0.0.1 8013 83
2013-10-31 14:06:51: (mod_proxy.c.1316) no proxy-handler found for: /
... after about one minute
2013-10-31 14:07:02: (mod_proxy.c.1361) proxy - re-enabled: 127.0.0.1 8013
These things also appear ever now and then:
2013-10-31 14:06:55: (network_linux_sendfile.c.94) writev failed: Connection timed out 600
2013-10-31 14:06:55: (mod_proxy.c.853) write failed: Connection timed out 110
...
2013-10-31 14:06:57: (mod_proxy.c.828) establishing connection failed: Connection timed out
2013-10-31 14:06:57: (mod_proxy.c.939) proxy-server disabled: 127.0.0.1 8013 45
So how can I tune gunicorn/lighttpd to serve more connections faster? What can I optimize? Do you know any other/better setup?
Thanks alot in advance for your help!
Update: Some more server info
root#django ~ # top
top - 15:28:38 up 100 days, 9:56, 1 user, load average: 0.11, 0.37, 0.76
Tasks: 352 total, 1 running, 351 sleeping, 0 stopped, 0 zombie
Cpu(s): 33.0%us, 1.6%sy, 0.0%ni, 64.2%id, 0.4%wa, 0.0%hi, 0.7%si, 0.0%st
Mem: 32926156k total, 17815984k used, 15110172k free, 342096k buffers
Swap: 23067560k total, 0k used, 23067560k free, 4868036k cached
root#django ~ # iostat
Linux 2.6.32-5-amd64 (django.myserver.com) 10/31/2013 _x86_64_ (4 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
33.00 0.00 2.36 0.40 0.00 64.24
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
sda 137.76 980.27 2109.21 119567783 257268738
sdb 24.23 983.53 2112.25 119965731 257639874
sdc 24.25 985.79 2110.14 120241256 257382998
md0 0.00 0.00 0.00 400 0
md1 0.00 0.00 0.00 284 6
md2 1051.93 38.93 4203.96 4748629 512773952
root#django ~ # netstat -an |grep :80 |wc -l
7129
Kernel Settings:
echo "10152 65535" > /proc/sys/net/ipv4/ip_local_port_range
sysctl -w fs.file-max=128000
sysctl -w net.ipv4.tcp_keepalive_time=300
sysctl -w net.core.somaxconn=250000
sysctl -w net.ipv4.tcp_max_syn_backlog=2500
sysctl -w net.core.netdev_max_backlog=2500
ulimit -n 10240