docker swarm - Performance Bottle Neck

docker swarm - Performance Bottle Neck - web-services

While doing Performance testing in Docker Swarm cluster, the Transactions Per Second(TPS) is not going beyond 400 TPS and the response time is gradually increasing.
While doing Performance testing on one server the TPS is around 200.
So with 10 nodes cluster, it should atleast go beyond 1500 TPS.
But the TPS is not going beyond 400. It seems the Leader is not able to handle more than 400 requests and distribute to the other 9 nodes in the cluster.
Any information on this will be really helpful.Is there any configurations that needs to be done in the Swarm cluster, that will increase the TPS
The docker swarm details are provided below:
Docker version: 1.12.1
Swarm Structure: - Leader(Manager):server1 - Other Managers: server2 and server3 - Workers: All other 7 servers/nodes
Application/Service endpoint:
http://server1:8080/Application/Service The above endpoint has been shared with our clients, so act as the load balancing endpoint.
The application is a webservice deployed in Tomcat 8 using docker.
Swarm Cluster
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
0415czstge3vibxh55kwnywyd server6 Ready Active
2keplduit5vaycpwzv9419gh7 server4 Ready Active
2r5e2ye9jhgko29s5bv7rolxq server3 Ready Active Reachable
5btrbs5qkrlr50uiip6n0y260 server9 Ready Active
7aqpnf79tv7aj1j5gqsmqph7x server10 Ready Active
856fyn6rdv9ypfz8o2jdsuj7p server2 Ready Active Reachable
a1gcuucxuuupg9gleu9miz7uk server5 Ready Active
a2uyjjhh7phm3wei2e1ydsc4o server7 Ready Active
bm7ztqyrbt7noak6lerfmcs3j * server1 Ready Active Leader
dwto8iizy8li46b7u6v9e4qk1 server8 Ready Active

Related

AWS ECS Fargate request not reaching task container

So I setup a aws ecs cluster to run a docker image of Valhalla service almost as-is.
Issue : target group seems to be not able to check for cluster health, like if the health request was reaching the cluster, but container is not "forwarding" the request to Valhalla.
Description :
I created a repository on AWS ECR, and pushed a docker image of gisops/valhalla with only the valhalla.json file changed.
Here is the valhalla configuration I used.
Note that I changed the default listening port from 8002 to 80.
I created a ECS Fargate cluster, and a service that uses this task definition to launch a container that runs Valhalla.
The service receives traffic from an application load balancer via port 80.
The target group is checking /status path on port 80.
All set, the task is then creating, and task logs shows that Valhalla is initializing perfectly and running.
However the target group is not able to check for health status : the request seems to timeout.
If the request was reaching valhalla, the task logs would have at least show it (because valhalla logs every incoming request by default), but it doesn't.
Therefore fargate kills the task (Task failed ELB health checks in (target-group {my-target-group-uri})) (showing that the health request was reaching the cluster service indeed)
I don't think the issue is with the valhalla configuration, because I can run the same docker image locally, and it works perfectly, using :
docker run -dt -p 3000:80 -v /local/path/to/valhalla-files:/custom_files/ --name valhalla gisops/valhalla:latest
And then checking localhost:3000/status
Anyone has an idea of what could be the issue ?
Already spent a lot of time on this, and I'm out of ideas. Thanks for your help !

Fail deploying simple HTTP server to ElasticBeanstalk when using Application Load Balancer

I'm unable to deploy the simplest docker-compose file to an ElasticBeanstalk environment configured with Application Load Balancer for high-availability.
This is the docker file:
version: "3.9"
services:
demo:
image: nginxdemos/hello
ports:
- "80:80"
restart: always
This is the ALB configuration:
EB Chain of events:
Creating CloudWatch alarms and log groups
Creating security groups
For the load balancer
Allow incoming traffic from the internet to my two listerners on ports 80/443
For the EC2 machines
Allow incoming traffic to the process port from the first security group created
Create auto scaling groups
Create Application Load Balancer
Create EC2 instance
Approx. 10 minutes after creating the EC2 instance (#5), I get the following log:
Environment health has transitioned from Pending to Severe. ELB processes are not healthy on all instances. Initialization in progress (running for 12 minutes). None of the instances are sending data. 50.0 % of the requests to the ELB are failing with HTTP 5xx. Insufficient request rate (2.0 requests/min) to determine application health (6 minutes ago). ELB health is failing or not available for all instances.
Looking at the Target Group, it is indicating 0 healthy instances (based on the default healthchecks)
When SSH'ing the instance, I see that the docker service is not even started, and my application is not running. So that explains why the instance is unhealthy.
However, what am I supposed to do differently? based on the understanding I have, to me it looks like a bug in the flow initiated by ElasticBealstalk, as the flow is waiting for the instances to be healthy before starting my application (otherwise, why the application wasn't started in the 10 minutes after the EC2 instance was created?)
It doesn't seem like an application issue, because the docker service was not even started.
Appreciate your help.

I tried to replicate your issue using your docker-compose.yml and Docker running on 64bit Amazon Linux 2/3.4.12 platform. For the test I created a zip file containing only the docker-compose.yml.
Everything works as expected and no issues were found.
The only thing I can suggest is to double check your files. Also there is no reason to use 443 as you don't have https at all.

Lots of connection logs after open ports of k8s service

Im now using aws k8s service (eks). deployments and Loadbalancer Services were used.
(2 nodes, 1 loadbalancer service, 1 deployment, 1 pod, 1 replicaset was used.)
However, when I add a port to the service, so many connections are being connected to the port I opened.
Log looks like as below.
[17:12:21.843] Client Connected [/192.168.179.222:28607]
[17:12:21.843] Client Disconnected [/192.168.179.222:28607]
[17:12:21.864] Client Connected [/192.168.179.222:16888]
[17:12:21.864] Client Disconnected [/192.168.179.222:16888]
[17:12:21.870] Client Connected [/192.168.79.91:58902]
[17:12:21.870] Client Disconnected [/192.168.79.91:58902]
[17:12:22.000] Client Connected [/192.168.179.222:52060]
[17:12:22.000] Client Disconnected [/192.168.179.222:52060]
[17:12:23.118] Client Connected [/192.168.79.91:14650]
[17:12:23.119] Client Disconnected [/192.168.79.91:14650]
192.168.179.222 and 192.168.79.91 are my nodes' IPs and logs are from pods.
I thought it is because of health check of aws loadbalancer, but health check interval is 30sec and it doesn't make sense.
Since lots of logs, i cannot see my real transaction logs.
How can I get rid of those connections? What is the reason of those logs?
--- add
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-192-168-179-222.ap-northeast-2.compute.internal Ready <none> 11d v1.16.12-eks-904af05 192.168.179.222 ########## Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 docker://19.3.6
ip-192-168-79-91.ap-northeast-2.compute.internal Ready <none> 11d v1.16.12-eks-904af05 192.168.79.91 ########## Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 docker://19.3.6
these are my node info. Im probably sure that ips from log is node IP.
I have several processes in my pod, and Every processes are countered with too many connection logs.

What you are seeing is a result of scanners on the internet restlessly trying to find vulnerable applications
To fix that and to have a cleaner logs you can
Do IP white listing on the security group, so that certain IPs can only connect to your service
Install WAF to filter scanners out
Also you may want to have a structured logs, where your legit logs has a certain format that can be easily spotted and filtered away from garbage logs created by the scanner

Troubleshooting ECS task failing container health checks

We are trying to transition our AWS EC2 instance to AWS ECS service running under Elastic Load Balancer. This EC2 instance is responsible for receiving a lot of network connections and saving received data to S3 bucket and sql. Problem is no matter how many tasks we scale up to or how big they are, they still fail container health check once it is used in production. Container health check we are using is this bash script:
#!/bin/bash
# Instead of pinging apache we can only check the service's status for incoming due to the high load apache deals with
cgi-fcgi -bind -connect localhost:9001 \
&& service apache2 status \
&& service cron status \
&& service ssh status \
&& service awslogs status \
&& service td-agent status \
|| exit 1
Currently I have 1 ECS task with max capacity of 2. Each task has 4 vCPU units and 16GB of RAM. Our original EC2 instance is of m5.2xlarge type and has 32GB of RAM and 8vCPU units. EC2 instance does not ever reach 100% CPU and Memory, so ECS tasks should be big enough to handle this network load.
Once we start sending requests to ECS task, it response from server becomes slower(requests taking 45 seconds) and then health check script eventually fails and ECS kills the task.
I have also tried using apache2buddy script to tune performance of apache2 web server, but it did not help as the script did not report any issues. Also tailing syslog and apache2 error log is not helpful as there are no obvious errors logged. So I am trying to find more ways to troubleshoot this and find why EC2 instance has no problems with all these network requests but docker container running on ECS does.

docker swarm - Leader(Manager) Node service endpoint and load balancing

docker swarm - Leader(Manager) Node service endpoint and load balancing.
Have two Queries related to Web service application deployment in docker swarm.
We have exposed the service endpoint to our customer as :
http://server1:8080/Application/Service
where server1 is our Leader(Manager) in our swarm cluster.This acts as our load balancer link also.
But what happens to the service endpoint, when server1(Leader) goes down.
As per Swarm, out of the other two more managers, one will be selected as Leader.
Let us assume that server2 becomes the Leader.
But does that mean that the previous server1 service endpoint will not work and it needs to get changed to:
http://server2:8080/Application/Service and we need to shared this new URL to our consumers?
If the URL gets changed, then its really very challenging.If not how does the previous server1 endpoint work,eventhough the server1 is down as a leader.?
In our current 10 nodes Swarm cluster, eventhough the service is deployed on all the 10 Servers, the service is working via only the Leader server endpoint:
http://server1:8080/Application/Service
where server1 is our Leader(Manager) in our swarm cluster.This acts as our load balancer link and shared with consumers.
All the individual endpoint on all the other servers doesn't work. Is this expected behavior in Swarm, that all traffic will go via the leader endpoint?
Leader(Manager):server1 -> Works:
http://server1:8080/Application/Service
Other Managers: server2 and server3 -> Don't work:
http://server2:8080/Application/Service
http://server3:8080/Application/Service
Workers: All other 7 servers/nodes -> Don't work:
http://server4:8080/Application/Service
http://server5:8080/Application/Service
.....................
The docker swarm details are provided below:
Docker version: 1.12.1
Swarm Structure:
- Leader(Manager):server1
- Other Managers: server2 and server3
- Workers: All other 7 servers/nodes
Application/Service endpoint:
http://server1:8080/Application/Service
The above endpoint has been shared with our clients, so act as the load balancing endpoint.
$ sudo docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
0415czstge3vibxh55kwnywyd server6 Ready Active
2keplduit5vaycpwzv9419gh7 server4 Ready Active
2r5e2ye9jhgko29s5bv7rolxq server3 Ready Active Reachable
5btrbs5qkrlr50uiip6n0y260 server9 Ready Active
7aqpnf79tv7aj1j5gqsmqph7x server10 Ready Active
856fyn6rdv9ypfz8o2jdsuj7p server2 Ready Active Reachable
a1gcuucxuuupg9gleu9miz7uk server5 Ready Active
a2uyjjhh7phm3wei2e1ydsc4o server7 Ready Active
bm7ztqyrbt7noak6lerfmcs3j * server1 Ready Active Leader
dwto8iizy8li46b7u6v9e4qk1 server8 Ready Active

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js