Currently Jetty has DOSFilter which appears to be providing protection against DOS attack i.e. it keeps track of number of requests from a connection. In DDOS attack, we expect attack could be from millions of ip addresses and in that case DOSFilter won't do the job. Any other strategy you could apply here so that Jetty could survive ?
Dealing with millions of IP addresses ...
This would need to be solution before the connection is accepted. some kind of OS or network hardware solution.
Jetty, being a server, has to accept the connection in order to do anything with it.
You could probably use the Jetty request log and a custom fail2ban setup to ban IP addresses at the OS level based on some kind of criteria in the access log. (too many requests on a connection over X amount of time, triggering an IP specific DOSFilter action, ban that IP at the OS level for Y amount of time)
Related
I am running a server (that uses tornado python) on a single AWS instance and I am running into spikes in websocket latency.
Profiling the round trip time from when a websocket message is sent to the client, which then immediately sends an ack message back to the server, to when the server receives the ack message yields an average of <.1 second, however I note sometimes it goes up to 3 seconds. Note: there are no spikes when running the server locally.
What could be the cause or fix for this? I looked at the CPU usage and it only goes up to 40% max. The spikes are not correlated with heavy traffic (2 or 3 clients usually) and the client's internet seems fine. I find it hard to believe the instance is going beyond capacity with such low usage.
The fact that the spike is 3 seconds is actually telling you a lot more than you may suspect, about the nature of the problem.
It's packet loss.
TCP, as you likely know, is said to provide "reliable" transport, guaranteeing that payload sent is received by the far end in the order in which it was sent, because TCP reassembles things in the correct order before delivering the payload. One significant way in which this is accomplished is by the automatic retransmission of packets that are considered to have been lost.
You'll never guess the default initial timer value for retransmissions of lost packets. Or, perhaps, now, you will.
It's 3 seconds in many, if not most, implementations, based on standards established several years ago in a time when the bandwidth and latency of today's transmission links were unheard of, perhaps unimagined.
You won't see evidence of the retransmission at at the websocket server or the client software, because TCP shields the higher layers from knowing that it occurs... but 3 seconds is a dead giveaway that this is exactly the problem.
You'll see the retransmissions of the traffic occurring if you observe the network traffic with a packet sniffer, though that will only serve to confirm that this is the issue.
It could be loss from server to client, or loss from client to server. The latter is generally more likely, since clients often have a lower amount of available upstream bandwidth... but the directionality of the packet loss doesn't clearly indicate the physical location where it is occurring. Unless your client keeps track of local time, so that request and response initiation times can be correlated, you don't know whether the delay is in the message, or in the acknowledgement.
Under relatively light load, it seems unlikely that the problem is on your instance or in the AWS network on your side, and you obviously can't connect a sniffer to arbitrary points on the Internet to pinpoint the problem.
Given a case like this, it may be easier -- and surprisingly feasible -- to prove where the problem isn't, rather than where it is.
One technique for this would be to create a deliberate detour for the traffic through different equipment located elsewhere -- such as a different AWS region or another cloud provider.
First, of course, you'll want to learn to spot these retransmissions using wireshark.
Then, configure a proxy server at a different location, using a simple TCP connection proxy -- such as HAProxy, or even a simple tool like redir or socat.
Such a configuration will listen for connections from clients, and when one is established, will create a new TCP connection to the destination (your websocket server) but -- importantly -- they only tie the two connections together at the payload level -- not the TCP level, and of course nothing lower -- so retransmissions will only be seen on the wire at all between this intermediate server and the end of the connection with the packet loss problem. The other end will show no evidence of the retransmissions -- just data arriving later than expected.
For this test to be meaningful, the proxy needs to be located away from the server and the client, and with no meaningful common infrastructure -- hence the suggestion of placing it in a different AWS region. A different availability zone in the same region may share common Internet infrastructure at some level, so that's not far enough away for this purpose.
If client <--> proxy <--> server shows TCP retransmissions on the path between proxy and server, and not between client and proxy, the problem really is likely to be in your server, its hardware, network, or Internet connection, and you'll have to proceed accordingly.
Conversely (and, I would suggest, more likely) if the path between proxy and server is free of retransmissions but the path between client and proxy is still dirty, you have eliminated the server and its infrastructure as the source of the problem. How to proceed is up to you, but at this point you do know what the problem... isn't.
Two other possibilities:
Both sides remain dirty, which is the least likely scenario. Rule 1 of troubleshooting is to assume initially that you only have one problem, not two.
Or, both sides are suddenly and unexectedly clean when traffic uses this setup, which suggests thay your test setup has routed around a broken piece of the Internet. You've "solved" it but have no idea how. We'll also hope this isn't the outcome, but given the vagaries of the global Internet, it's not unthinkable that your stack may include components like this, with geolocation-DNS-based selection of an intermediate endpoint. This seems like a convolution but does have its place.
Such a tactic is actually part of the logic behind the S3 transfer acceleration feature. The content is not any closer to the end user, but the TCP connection from the browser is being terminated on equipment in the AWS edge network, at a location that is often nearer to the browser, and a second TCP connection back to the bucket is established, with the payload connected together... and, yes, it's faster and more stable, with the significance of the change becoming more notable as distance and connection quality vary.
Recently dealt with a botnet running a sub-domain brute force/crawling script. Would run through the alphabet & numbers sequentially, which resulted in a minor nuisance and small load increase for legitimate traffic.
For example, hitting a.domain, b.domain, .., 9.domain, aa.domain, .., a9.domain. etc.
Obviously, this is quite stupid and fortunately it only originated from a few IP's at a time and the website in question was behind multiple auto-scaling load balancers. Attacks were stopped grabbing the X-Forwarded-For address from Varnish, detection was scripted via the subdomain attempts and the IP added to a remote blocklist which would be regularly refreshed and added into a Block.vcl on all Varnish servers, voila.
This worked well, detecting and taking care of things within a couple minutes each time. However it was noted that in the space of time after blocking an brute IP and applying blockage, 99.9% of the traffic would stop but the occasional requests from the blocked IP would still manage to get through. Not enough to cause a fuss, but more raise the question why? As I don't understand why a request at the varnish level would still make it through when hitting the reject on IP rule of my Block.vcl?
Is there some inherent limitation that might have come into play here which would allow a small number of requests through? Maybe based on the available resources or sheer number of requests per second hitting Varnish overwhelming it ever so slightly?
Resource wise the web servers seemed fine so I'm unsure. Any ideas?
In a Linux/C++ TCP server I need to prevent a malicious client from opening multiple sockets, otherwise they could just open thousands of connections until the server crashes.
What is the standard way for checking if the same computer already has a connection on the server? If I do it based off of IP address wouldn't that mean two people in the same house couldn't connect to the server at the same time even if they are on different computers?
Any info helps!
Thanks in advance.
TCP in itself doesn't really provide anything other than the IP address for identifying clients. A couple of (non-exclusive) options:
1) Limit the number of connections from any IP address to a reasonable number, like 10 or 20 (depending on what your system actually does.) This way, it will prevent malicious DoS attacks, but still allow for reasonable usability.
2) Limit the maximum number of connections to something reasonable.
3) You could delegate this to a higher-layer solution. As a part of your protocol, have the client send a unique identifier that is generated only once (per installation, etc). This could be easily spoofed, however.
I believe 1 and 2 is how many servers handle it. Put them in config files, so they can be tuned depending on the scenario.
There is only IP address to base "is this the same sender", unless you have some sort of subscription/login system (but then someone can try to log in a gazillion times at once, since there must be some sort of handshake for logging in).
If two clients are using the same router (that uses NAT or some similar scheme), your server will see the same IP address, so allowing only one connection per IP address wouldn't work very well for "multiple users from the same home". This also applies if they are for example using a university network or a company network.
So depending on what you are supplying and how many clients you can expect from the same place, you may need to go a fair bit higher than 10. Of course, if you log when this happens, and you see a fair number of "looks like valid real users failing to get in", you can adjust the number.
It may also make sense to have some sort of "rolling average", so you accept X new connections per Y seconds from each IP address, rather than having a fixed maximum number. This is meaningful if connections last quite some time... For short duration connections, it's pretty pointless...
I have a large number of machines (thousands and more) that every X seconds would perform an HTTP request to a Jetty server to notify they are alive. For what value of X should I use persistent HTTP connections (which limits number of monitored machines to number of concurrent connections), and for what value of X the client should re-establish a TCP connection (which in theory would allow to monitor more machines with the same Jetty server).
How would the answer change for HTTPS connections? (Assuming CPU is not a constraint)
This question ignores scaling-out with multiple Jetty web servers on purpose.
Update: Basically the question can be reduced to the smallest recommended value of lowResourcesMaxIdleTime.
I would say that this is less of a jetty scaling issue and more of a network scaling issue, in which case 'it depends' on your network infrastructure. Only you really know how your network is laid out and what sort of latencies are involved in order to come up with a value of X.
From an overhead perspective the persistent HTTP connections will of course have some minor effect (well I say minor but depends on your network) and the HTTPS will again have a larger impact....but only from a volume of traffic perspective since you are assuming CPU is not a constraint.
So from a jetty perspective, it really doesn't need to be involved in the question, you seem to ultimately be asking for help optimizing bytes of traffic on the wire so really you are looking for the best protocol at this point. Since with HTTP you are having to mess with headers for each request you may be well served looking at something like spdy or websocket which will give you persistent connections but are optimized for low round trip network overhead. But...they seem sort of overkill for a heartbeat. :)
How about just make them request at different time? Assume first machine request, then you pick a time to response to that machine as the next time to heart beat of that machine (also keep the id/time at jetty server), the second machine request, you can pick another time to response to second machine.
In this way, you can make each machine perform heart beat request at different time so no concurrent issue.
You can also use a random time for the first heart beat if all machines might start up at the same time.
So this is more of a general question on the best practice of preventing DoS attacks, I'm just trying to get a grasp on how most people handle malicious requests from the same IP address which is the problem we are currently having.
I figure it's better to block the IP of a truly malicious IP as high up as possible as to prevent using more resources, especially when it comes to loading you application.
Thoughts?
You can prevent DoS attacks from occuring in various ways.
Limiting the number of queries/second
from a particular ip address. Once
the limit is reached, you can send a
redirect to a cached error page to
limit any further processing. You
might also be able to get these IP
address firewalled so that you don't
have to process their requests at
all. Limiting requests per IP address
wont work very well though if the
attacker forges the source IP address
in the packets they are sending.
I'd also be trying to build some
smarts into your application to help
dealing with a DoS. Take Google maps
as an example. Each individual site
has to have it's own API key which I
believe is limited to 50,000 requests
per day. If your application worked
in a similar way, then you'd want to
validate this key very early on in
the request so that you don't use too
many resources for the request. Once
the 50,000 requests for that key are
used, you can send appropriate proxy
headers such that all future requests
(for the next hour for example) for
that key are handled by the reverse
proxy. It's not fool proof though. If
each request has a different url,
then the reverse proxy will have to
pass through the request to the
backend server. You would also run
into a problem if the DDOS used lots
of different API keys.
Depending on the target audience for
your application, you might be able
to black list large IP ranges that
contribute significantly to the DDOS.
For example, if your web service is
for Australian's only, but you were
getting a lot of DDOS requests from
some networks in Korea, then you
could firewall the Korean networks.
If you want your service to be
accessible by anyone, then you're out
of luck on this one.
Another approach to dealing with a DDOS is to
close up shop and wait it out. If
you've got your own IP address or IP
range then you, your hosting company
or the data centre can null route the
traffic so that it goes into a block
hole.
Referenced from here. There are other solutions too on same thread.
iptables -I INPUT -p tcp -s 1.2.3.4 -m statistic --probability 0.5 -j DROP iptables -I INPUT n -p tcp -s 1.2.3.4 -m rpfilter --loose -j ACCEPT
# n would be an numeric index into the INPUT CHAIN -- default is append to INPUT chain
more at...
Can't Access Plesk Admin Because Of DOS Attack, Block IP Address Through SSH?