Filebeat Loadbalancing not working - google-cloud-platform

I have setup a Logstash Cluster in Google Cloud that sits behind a Load Balancer and uses Autoscaling (-> when the load gets to high new instances are started up automatically).
Unfortunately this does not work properly with Filebeat. Filebeat only hits those Logstash Vms that existed prior to starting up Filebeat.
Example:
Lets assume I initially have those 3 Logstash hosts running:
Host1
Host2
Host3
When I startup Filebeat, it correctly distributes the messages to Host1, Host2 and Host3.
Now the Autoscaling kicks and and spins up 2 more instances, Host4 and Host5.
Unfortunately Filebeat still only sends messages to Host1, Host2 and Host3. The new hosts, Host4 and Host5, are ignored.
When I now restart Filebeat it sends messages to all 5 hosts!
So it seems Filebeat only sends messages to those hosts that have been running when Filebeat starts up.
My filebeat.yml looks like this:
filebeat.inputs:
- type: log
paths:
...
...
output.logstash:
hosts: ["logstash-loadbalancer:5044", "logstash-loadbalancer:5044"]
worker: 1
ttl: 2s
loadbalance: true
I have added the same host (the loadbalancer) twice because I've read in the forums that otherwise Filebeat won't loadbalance messages -> I can confirm that.
But still loadbalancing seems to not work properly, e.g. TTL seems not to be respected because it always targets the same connections.
Is my configuration wrong? Bug in Filebeat?

Hope you already resolved this problem. In case you haven't, you should set the pipelining to 0 as below: (ttl only works if pipelining is set to 0)
output.logstash:
hosts: ["logstash-loadbalancer:5044", "logstash-loadbalancer:5044"]
worker: 1
ttl: 2s
loadbalance: true
pipelining: 0

Related

Fargate task stops about 10s after is starts with no log output

My Fargate task keeps stopping after it's started and doesn't output any logs (awslog driver is selected).
The container does start up and stay running when i execute docker locally.
Docker-compose file:
version: '2'
services:
asterisk:
build: .
container_name: asterisk
restart: always
ports:
- 10000-10099:10000-10099/udp
- 5060:5060/udp
Dockerfile:
FROM debian:10.7
RUN {stuff-that-works-is-here}
# Keep Asterisk running in the foreground
ENTRYPOINT ["asterisk", "-f"]
# SIP port
EXPOSE 5060:5060/udp
# RTP ports
EXPOSE 10000-10099:10000-10099/udp
my task execution role has full cloudwatch access for debugging.
Click on the ECS task instance, expand the container section, the error should be shown there. I have attached a screen shot of it. Here is a screenshotScrenshot
The AWS log driver alone is not enough.
Unfortunately, Fargate doesn't create the log group for you unless you tell it to
See Creating a log group at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html
I had a similar problem, and the cause was the Health Check.
ECS dont have Health Check for UDP, so when you open a UDP port if you use Docker for the deploy (docker compose), it create a Health Check pointing to a TCP port, and since there was no open TCP ports for that range, the container reset itself due to Health Check.
I had to add a custom Resource to docker-compose:
x-aws-cloudformation:
Resources:
AsteriskUDP5060TargetGroup:
Type: "AWS::ElasticLoadBalancingV2::TargetGroup"
Properties:
HealthCheckProtocol: TCP
HealthCheckPort: 8088
Basically I have a Health Check for a UDP port pointing to a TCP port. Its a "hack" to bypass this problem when the deploy is made with Docker.

AWS ECS service Tasks getting replaced with (reason Request timed out)

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as
service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)
This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.
Upgraded Docker storage driver from devicemapper to overlay2
We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.
We increase health check grace period for the service from 0 to 240secs
Increased KeepAliveTimeout and SocketTimeout to 180 secs
Enabled awslogs on containers instead of stdout, but there was no unusual behavior
Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.
Enabled container insights for better container level debugging
Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.
The amount of errors have come down amazingly with these two but still we are getting this issue once a while.
We have seen all the graphs related to instance and container which went down below are the logs for it:
ECS container insights logs for victim container :
Query :
fields CpuUtilized, MemoryUtilized, #message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"
Example Logs answered :
{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}
None of logs were having CPU and Memory utilised ridiculously high.
We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins
Our health check configurations are below :
Protocol HTTP
Path /healthcheck
Port traffic port
Healthy threshold 10
Unhealthy threshold 2
Timeout 5
Interval 10
Success codes 200
Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :
docker info
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.
Your steps from 1 to 7 almost no thing do with the error.
service example-service (instance i-016b0a460d9974567) (port 1047) is
unhealthy in target-group example-service due to (reason Request timed
out)
The error is very clear, you ECS service is not reachable to Load balancer health check.
Target Group Unhealthy
When this is the case, go straight and check the container SG, Port, application status or health status code.
Possible reason
There might be the case, there is no route Path /healthcheck in the backend service
The status code from /healthcheck is not 200
Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080
The security group is not allowing traffic on the target group
Application is not running in the container
These are the possible reason when there is a timeout from health check.
I faced the same issue of ( Reason request timeout ).
I managed to solve it by updating my security-group inbound rules.
Currently, there was no rule defined in Inbound rules so I add general allow-all traffic for the ipv4 rule for the time being because I was in development at that time.

scale some (but not all) tasks in ECS cluster

I have a very simple docker-compose for locust. It consists of one master (which is basically a webserver for a client) and one slave (which is a client that actually performs load testing, which is what locust is for).
version: "3"
services:
locust-master:
image: chapkovski/locust
ports:
- "80:80"
environment:
LOCUST_MODE: master
locust-slave:
image: chapkovski/locust
"links": [
"locust-master"
]
environment:
LOCUST_MODE: slave
LOCUST_MASTER_HOST: locust-master
LOCUST_MASTER_PORT: 5557
Everything works on AWS ECS. But now I would like to have multiple slaves connected to the same master, and I can't figure out how to do this. Because when I try to scale up the tasks, that results in an error due to the fact that ports are already busy. Which is obvious because scaling up this task definition makes ECS agent to have several masters at the same port.
When I try to split master and slave so I would have two tasks, and I would be able to scale up only the 'slave' one, then of course they cannot communicate, and the master does not see any clients.
So what is the correct way of scaling up only 'client' part, if, let's say I need 20 clients and one master?
You can not scale services with predefined port, if you do so you will get error Ports are already busy.
You have to option to resolve this issue.
One service per EC2 instance ( not good enough but way around)
Dynamic port binding
With the second option, ECS agent assigns a dynamic port which not conflict with any occupied port so can scale as many tasks as you want.
You need set host port 0 in port mapping section.
understanding-dynamic-port-mapping-in-amazon-ecs-with-application-load-balancer
"portMappings": [
{
"containerPort": 3000,
"hostPort": 0
}
]

AWS instance can't be accessed from browser

I set up kubernetes cluster and then I deployed it on AWS . It created one load balancer, one master and 4 minion nodes.
I can use kubectl proxy command to check whether it works locally and it turned out that yes. I am able to connect to a particular pod.
The problem is that I can't access it externally. I have IP which looks like this :
ab0154f2bcc5c11e6aff30a71ada8ce9-447509613.eu-west-1.elb.amazonaws.com
I also modified security group, so each node has a following security group :
Ports Protocol Source
80 tcp 0.0.0.0/0
8080 tcp 0.0.0.0/0
All All sg-4dbbce2b, sg-4ebbce28, sg-e6a4d180
22 tcp 0.0.0.0/0
What might be wrong with this configuration ?
Does the service which created the ELB have endpoints. Do a kubectl describe svc <serviceName> and check the endpoints section. If not then you need to match up the selectors better. If you do see them then I would try hitting the nodeport from one of the machines to verify it works. A simple curl should work. If that works then I would look deeper into the AWS security.

How to make a rolling upgrade with ansible in a different AWS VPC subnet?

We are running a continuous integration server (jenkins) in a public subnet of our AWS VPC, and we'd like to trigger an upgrade as a post build task when a commit is made to master...
The easiest way would be to let ansible to ssh-in the machines, pull the latest master and restart the service, then proceed to the next one; but since the CI host is running in a different subnet we cannot reach the servers.
Our autoscaling configuration user data script fetches the HEAD of the repository automatically upon start, so all we'd need to do is terminate all the existing instances in the ELB all let autoscaling to bring up the new ones.
The problem is that I don't know how to specify in the playbook that it should wait until a new instance is up and running before terminating the next one.
Another option that would work is to bring up all the new instances at once and when they are up and running detach from the ELB and terminate all the old ones (but I cannot find examples about how to do that either!).
Another option is to create a new autoscale group with a new ELB and all new instances. When they're launched you can run tests against the new ELB, confirm everything is working, then swing DNS over. The gotcha with this approach is around ELB pre-warming. If your site is busy the new ELB will not handle the traffic unless you ask Amazon to "prewarm" it for you.
If you want to check whether a server has started or shutdown you can do this from your Jenkins server, although it needs to be able to reach your instances inside your VPC.
---
name: Deployment playbook
hosts: all
tasks:
# Using port 2222 here but you can use port 80 or 443
- name: Wait for new deployment instance to come up.
local_action: wait_for host=your_new_deployment_host port=2222
- name: Shutdown or terminate old servers
local_action: command ec2-terminate-instances <your-old-server-instance-id>
To get around the VPC issue you can probably setup an ssh tunnel from your server at startup to your Jenkins server (public IP) like this:
ssh -f -N -R2222:localhost:22 jenkins#yourjenkinsserver.com -S /tmp/control-socket
and if you want to check for http in Jenkins on port 8888
ssh -f -N -R8888:localhost:80 jenkins#yourjenkinsserver.com -S /tmp/control-socket-http
The trick here is that you need to bring up the new servers sequentially because you are always using the same port 8888 on the Jenkins server to check. Also you need to terminate the tunnels after you've done the check. From your Jenkins server:
name: Terminate tunnel HTTP
local_action: shell ssh -p 2222 localhost 'ssh -S /tmp/control-socket-http -O exit yourjenkinsserver.com'
name: Terminate tunnel SSH
local_action: shell ssh -p 2222 localhost 'ssh -S /tmp/control-socket -O exit yourjenkinsserver.com'
A lot of moving pieces but should do the trick.