Google firewall being saturated and not allowing valid requests through - google-cloud-platform

We've configured some Firewall rules to block some bad ips. This has been done in the VPC Network -> Firewall area. This is NOT done on the server via IPTables or anything.
Everything is fine until we have floods of traffic from these bad ips. I can see in the firewall log for this rule it was blocking them, but there is either a connection limit or bandwidth limit. For 40 minutes I have a solid wall hit counts of 24,000 for ever minute - no up or down just 24,000 constantly.
The server was getting no traffic, resource usage was way down. This was a problem because valid traffic was getting bottle necked somewhere.
The only thing I can see in the docs is a limit of 130,000 maximum stateful connections.
https://cloud.google.com/vpc/docs/firewalls#specifications
Machine type is n1-standard-4
During this attack when I looked at the quotas page, nothing was maxed out.
Is anyone able to shed some light on this?

The answer is to resize the instance size and add more cores.
Don't use instanced with shared cores.
I went for an n2 with 8 cores and this has now resolved it's self.

Related

Ping between aws and gcp

I have created a Site to Site VPN connection between VPC of Google cloud Platform and AWS in North Virginia region for both the VPCs. But the problem is I have been getting a very high ping and low bandwidth while communicating between the instances. Can any one tell me the reason for this?
image showing the ping data
The ping is very high considering they are in a very close region. Please help.
Multiple reason behind the cause :
1) verify gcp network performance by gcping
2) verify the tcp size and rtt for bandwidth
3) verify with iperf or tcpdump for throughput
https://cloud.google.com/community/tutorials/network-throughput
Be aware that any VPN will be traversing the internet, so even though they are relatively close to each other there will be multiple hops before the instances are connected together.
Remember that from the instance it will need to route outside of AWS networks, then to any hops on the internet to GCP and finally routed to the instance and back again to return the response
In addition there is some variation in performance as the line will not be dedicated.
If you want dedicated performance, without traversing the internet you would need to look at AWS Direct Connect. However, this might limit your project because of cost.
One of the many limits to TCP throughout is:
Throughput <= EffectiveWindowSize / RoundTripTime
If your goal is indeed higher throughput, then you can consider tweaking the TCP window size limits. The default TCP window size under Linux is ~3MB. However, there is more to EffectiveWindowSize than that. There is also the congestion window, which will depend on factors such as packet losses and congestion control heuristics being used (eg cubic vs bbr).
As far as sanity checking the ping RTTs you are seeing, you can compare with ping times you see between an instance in AWS us-east-1 and GCP us-east4 when you are not using a VPN.

How to know if an elb is handling a high load?

We are experiencing a high traffic load, and the great mayority of requests are failing. We have added more instances behind our elb but that does not solve the issue. Our db is at 50% cpu usage. Is there a way to test if the elb is the one that cannot handle the load?
ELB scales infinitely so it's almost never became bottleneck. You can start from CloudWatch metrics, but it's will be good to check VPC flow logs, not only http response codes from ELB.
Also, ALB have log feature that need to be turned on manually

AWS ALB catastrophic failure

First, the background:
Yesterday our AWS-based business in US West 2, consisting of two auto-scale groups (and various other components like RDS further back) behind an ALB went offline for six hours. Service was only reinstated by building an entirely new ALB (migrating over the rules and target groups).
At 4:15am our local time (GMT+10) the ALB ceased to receive inbound traffic and would not respond to web traffic. We used it for port 80 and port 443 (with SSL cert) termination. At the same time, all target group instances were also marked as "Unhealthy" (although they most certainly were operable) and no traffic was forwarded on to them. DNS resolved correctly to the ALB. It simply stopped responding. Equivalent symptoms to a network router/switch being either switched off or firewalled out of existence.
Our other EC2 servers that were not behind the ALB continued to operate.
Initial thoughts were:
a) deliberate isolation by AWS? Bill not paid, some offence taken at an abuse report? Unlikely and AWS had not notified us of any transgression or reason to take action.
b) A mistake on our part in network configuration? No change had been made in days to NACL or security groups. Further we were sound asleep when it happened, nobody was fiddling with settings. When we built the replacement ALB we used the same NACL and security groups without problem.
c) Maintenance activity gone wrong? This seems most likely. But AWS appeared not to detect the failure. And we didn't pick it up because we considered a complete, inexplicable, and undetected failure of an ALB as "unlikely". We will need to put in place some external healthchecks of our own. We have some based upon Nagios so can enable alarming. But this doesn't help if an ALB is unstable - it is not practical to keep having to build a new one if this reoccurs.
The biggest concern is that this happened suddenly and unexpectedly and that AWS did not detect this. Normally we are never worried about AWS network infrastructure as "it just works". Until now. There's no user-serviceable options for an ALB (eg restart/refresh).
And now my actual question:
Has anyone else ever seen something like this? If so, what can be done to get service back faster or prevent it in the first place? If this happened to you what did you do?
I'm going to close this off.
It happened again the following Sunday, and again this evening. Exact same symptoms. Restoration was initially achieved by creating a new ALB and migrating rules and target groups over. Curiously, the previous ALB was observed to be operational again but when we tried to reinstate it then it failed again.
Creating new ELBs is no longer a workaround and we've switched to an AWS business support to get direct help from AWS.
Our best hypotheses is this: AWS have changed something in their maintenance process and the ALB (which is really just a collection of EC2 instances with some AWS "proprietary code") is failing but it's really just wild speculation.

Fixing AWS Application Load Balancer connection timeouts under small loads

I'm currently using an application load balancer to split traffic between 3 instance.
Testing the individual instance (connection via IP), they are able to handle about ~200req/second without any connection timeout (timeout being set a 5 seconds).
As such, I'd assume that load balancing over all 3 of them would scale this up to ~600req/second (there's no bottleneck further down the pipe to stop this).
However, when sending the exact same type of test requests to an application load balancer, the connections start to time out before I even hit 100req/second.
I already eliminated the possibly of a low down due to HTTPS (by just sending the requests over HTTP), the instance themselves are healthy and not under heavy use an the load balancer reports no "errors".
I've also configured IP stickiness for ~20 minutes to try and improve the situation but it hasn't helped one bit.
What could be the cause of this problem ? I found no information about increasing the network capacity of LB on aws and no similar questions, so I'm bound to be doing something wrong, but I'm quite unsure what that something is.

AWS and ELB Network throughput limits

My site runs on AWS and uses ELB
I regularly see 2K con-current users, and during these times, requests through my stack would become slow and take a long time to get a response (30s-50s)
None of my servers or database at this time, would show significant load.
Which leads me to believe my issue could be related to ELB.
I have added some images of a busy day on my site, which shows graphs of my main ELB. Can you perhaps spot something that would give me insight into my problem?
Thanks!
UPDATE
The ELB in the screengrabs is my main ELB forwarding to multiple varnish cache servers. In my varnish vcl I would send misses for a couple of URL's but varnish have a queing behavior and what I ended up doing was set a high ttl for these request, and return hit_for_pass for them. What this does is let varnish know in the vcl_recv that these requests should be passed to the back-end immediately. Since doing this, the problem outlined above has completely been fixed
did you ssh into one of the servers? Maybe you reach some connection limit in apache or whatever server you run. Also check the cloudwatch monitors of EBS volumes attached to your instances, maybe they cause a io bottleneck.