Some Varnish Requests Getting Past my Block.vcl - amazon-web-services

Recently dealt with a botnet running a sub-domain brute force/crawling script. Would run through the alphabet & numbers sequentially, which resulted in a minor nuisance and small load increase for legitimate traffic.
For example, hitting a.domain, b.domain, .., 9.domain, aa.domain, .., a9.domain. etc.
Obviously, this is quite stupid and fortunately it only originated from a few IP's at a time and the website in question was behind multiple auto-scaling load balancers. Attacks were stopped grabbing the X-Forwarded-For address from Varnish, detection was scripted via the subdomain attempts and the IP added to a remote blocklist which would be regularly refreshed and added into a Block.vcl on all Varnish servers, voila.
This worked well, detecting and taking care of things within a couple minutes each time. However it was noted that in the space of time after blocking an brute IP and applying blockage, 99.9% of the traffic would stop but the occasional requests from the blocked IP would still manage to get through. Not enough to cause a fuss, but more raise the question why? As I don't understand why a request at the varnish level would still make it through when hitting the reject on IP rule of my Block.vcl?
Is there some inherent limitation that might have come into play here which would allow a small number of requests through? Maybe based on the available resources or sheer number of requests per second hitting Varnish overwhelming it ever so slightly?
Resource wise the web servers seemed fine so I'm unsure. Any ideas?

Related

Intermittent 502 gateway errors with AWS ALB in front of ECS services running express / nginx

Backgound:
We are running a single page application being served via nginx with a node js (v12.10) backend running express. It runs as containers via ECS and currently we are running three t3a mediums as our container instances with the api and web services each running 6 replicas across these. We use an ALB to handle our load balancing / routing of requests. We run three subnets across 3 AZ's with the load balancer associated with all three and the instances spread across the 3 AZ's as well.
Problem:
We are trying to get to the root cause of some intermittent 502 errors that are appearing for both front and back end. I have downloaded the ALB access logs and the interesting thing about all of these requests is that they all show the following.
- request_processing_time: 0.000
- target_processing_time: 0.000 (sometimes this will be 0.001 or at most 0.004)
- response_processing_time: -1
At the time of these errors I can see that there were healthy targets available.
Now I know that some people have had issues like this with keepAlive times that were shorter on the server side than on the ALB side, therefore connections were being forceably closed that the ALB then tries to reuse (which is in line with the guidelines for troubleshooting on AWS). However when looking at the keepAlive times for our back end they are set higher than our ALB currently by double. Also the requests themselves can be replayed via chrome dev tools and they succeed (im not sure if this is a valid way to check a malformed request, it seemed reasonable).
I am very new to this area and if anyone has some suggestions as to where to look or what sort of tests to run that might help me pinpoint this issue it would be greatly appreciated. I have run some load tests on certain endpoints and duplicated the 502 errors, however the errors under heavy load differ from the intermittent ones I have seen on our logs in that the target_processing_time is quite high so to my mind this is another issue altogether. At this stage I would like to understand the errors that show a target_processing_time of basically zero to start with.
I wrote a blog post about this a bit over a year ago that's probably worth taking a look at (caused due to a behavior change in NodeJS 8+):
https://adamcrowder.net/posts/node-express-api-and-aws-alb-502/
TL;DR is you need to set the nodejs http.Server keepAliveTimeout (which is in ms) to be higher than the load balancer's idle timeout (which is in seconds).
Please also note that there is also something called an http-keepalive which sets an http header, which has absolutely nothing to do with this problem. Make sure you're setting the right thing.
Also note that there is currently a regression in nodejs where setting the keepAliveTimeout may not work properly. That bug is being tracked here: https://github.com/nodejs/node/issues/27363 and is worth looking through if you're still having this problem (you may need to also set headersTimeout as well).

What happened with polyfill.io 's CDN SSL certificate?

Certificate errors happen from time to time but this looks very fishy too me. A certificate for all those names? What's going on? Got the CDN hacked?
If so, what is the best thing to do? Removing it until it is fixed? That would be bad for a good part of our user-base. A hacked CDN is worse though, I guess… Maybe someone knows what really is going on?
I am one of the maintainers of polyfill.io, and a Fastly employee. Yesterday we enacted a change to our DNS configuration to enable support for HTTP/2.0. In doing so, a small typo was made in the hostname, resulting in our DNS targeting the wrong endpoint on Fastly's network, and a cert that was not valid for polyfill.io or cdn.polyfill.io. Having realised the error, we corrected the entry and it took around 30 minutes to propagate.
Lessons learnt include not increasing DNS TTL until some time after a change is made, in case the change needs to be rolled back.
The reason there are so many names listed on the cert is that we are sharing a cert with other Fastly customers. This is perfectly normal practice for CDN providers.
More information is available on the relevant GitHub issue:
https://github.com/Financial-Times/polyfill-service/issues/1208
We're very disappointed to suffer this downtime. Generally, polyfill.io has a very good uptime record, and we plan for origin outages. It's hard to mitigate the risks associated with DNS changes to the main public domain, but we are very sorry to everyone impacted.
Polyfill.io uses pingdom to independently monitor our uptime and reports that number here: https://polyfill.io/v2/docs/usage (data has up to 24 hrs latency).
Looks like "they" (see below) botched it, I can't see cdn.polyfill.io or *.polyfill.io on that large list, hence the error saying much the same.
(or maybe I overlooked some other problem)
To enlighten you about the names, virtual hosting (the act of hosting multiple websites on the same IP address on the same HTTP port) occurs over HTTPS /after/ the encryption is established, thus, at the time the server presents a certificate to the browser, it doesn't know which site exactly the user is after, that information is part of the encrypted request.
Thus, it is necessary for the certificate to cover all secure websites operating on that IP address and port combination.
CDN for Content Delivery Network, presumably a huge bunch of stuff is being hosted on this "network", probably not even owned by polyfill (i've no idea who they are), given the first name on the certificate is "f2.shared.global.fastly.net" you can speculate the true CDN, who actually messed up the cert, and what else they're hosting on the CDN there :)

How much overhead would a DNS call add to the response time of my API?

I am working on a cross-platform application that runs on iOS, Android and the web. Currently there is an API server which interacts with all the clients. Each request to the API is made through the ip (eg. http://1.1.1.1/customers). This disallows me to move the backend quickly whenever I want to another cloud VPS as I need to update iOS and Android versions of the app with a painful migration process.
I though the solution would be introducing a subdomain. (eg http://api.example.com/customers). How much would an additional DNS call would affect the response times?
The thing to remember about DNS queries is that, as long as you have configured your DNS sensibly, clients will only ever make a single call the first time communication is needed.
A dns query will typically involve three queries, one to the root server, one to the .com (etc) server, and a final one to the example.com domain. Each of these will take milliseconds and will be performed once, probably every hour or so whenever the TTL expires.
The TL;DR is basically that it is a no brainer, you get far far more advantages from using a domain name than you will ever get from an IP Address. The time is minimal, the packet size is tiny.

Optimizing Jetty for heartbeat detection of thousands of machines?

I have a large number of machines (thousands and more) that every X seconds would perform an HTTP request to a Jetty server to notify they are alive. For what value of X should I use persistent HTTP connections (which limits number of monitored machines to number of concurrent connections), and for what value of X the client should re-establish a TCP connection (which in theory would allow to monitor more machines with the same Jetty server).
How would the answer change for HTTPS connections? (Assuming CPU is not a constraint)
This question ignores scaling-out with multiple Jetty web servers on purpose.
Update: Basically the question can be reduced to the smallest recommended value of lowResourcesMaxIdleTime.
I would say that this is less of a jetty scaling issue and more of a network scaling issue, in which case 'it depends' on your network infrastructure. Only you really know how your network is laid out and what sort of latencies are involved in order to come up with a value of X.
From an overhead perspective the persistent HTTP connections will of course have some minor effect (well I say minor but depends on your network) and the HTTPS will again have a larger impact....but only from a volume of traffic perspective since you are assuming CPU is not a constraint.
So from a jetty perspective, it really doesn't need to be involved in the question, you seem to ultimately be asking for help optimizing bytes of traffic on the wire so really you are looking for the best protocol at this point. Since with HTTP you are having to mess with headers for each request you may be well served looking at something like spdy or websocket which will give you persistent connections but are optimized for low round trip network overhead. But...they seem sort of overkill for a heartbeat. :)
How about just make them request at different time? Assume first machine request, then you pick a time to response to that machine as the next time to heart beat of that machine (also keep the id/time at jetty server), the second machine request, you can pick another time to response to second machine.
In this way, you can make each machine perform heart beat request at different time so no concurrent issue.
You can also use a random time for the first heart beat if all machines might start up at the same time.

Blocking IP addresses, preventing DoS attacks

So this is more of a general question on the best practice of preventing DoS attacks, I'm just trying to get a grasp on how most people handle malicious requests from the same IP address which is the problem we are currently having.
I figure it's better to block the IP of a truly malicious IP as high up as possible as to prevent using more resources, especially when it comes to loading you application.
Thoughts?
You can prevent DoS attacks from occuring in various ways.
Limiting the number of queries/second
from a particular ip address. Once
the limit is reached, you can send a
redirect to a cached error page to
limit any further processing. You
might also be able to get these IP
address firewalled so that you don't
have to process their requests at
all. Limiting requests per IP address
wont work very well though if the
attacker forges the source IP address
in the packets they are sending.
I'd also be trying to build some
smarts into your application to help
dealing with a DoS. Take Google maps
as an example. Each individual site
has to have it's own API key which I
believe is limited to 50,000 requests
per day. If your application worked
in a similar way, then you'd want to
validate this key very early on in
the request so that you don't use too
many resources for the request. Once
the 50,000 requests for that key are
used, you can send appropriate proxy
headers such that all future requests
(for the next hour for example) for
that key are handled by the reverse
proxy. It's not fool proof though. If
each request has a different url,
then the reverse proxy will have to
pass through the request to the
backend server. You would also run
into a problem if the DDOS used lots
of different API keys.
Depending on the target audience for
your application, you might be able
to black list large IP ranges that
contribute significantly to the DDOS.
For example, if your web service is
for Australian's only, but you were
getting a lot of DDOS requests from
some networks in Korea, then you
could firewall the Korean networks.
If you want your service to be
accessible by anyone, then you're out
of luck on this one.
Another approach to dealing with a DDOS is to
close up shop and wait it out. If
you've got your own IP address or IP
range then you, your hosting company
or the data centre can null route the
traffic so that it goes into a block
hole.
Referenced from here. There are other solutions too on same thread.
iptables -I INPUT -p tcp -s 1.2.3.4 -m statistic --probability 0.5 -j DROP iptables -I INPUT n -p tcp -s 1.2.3.4 -m rpfilter --loose -j ACCEPT
# n would be an numeric index into the INPUT CHAIN -- default is append to INPUT chain
more at...
Can't Access Plesk Admin Because Of DOS Attack, Block IP Address Through SSH?