GCP Backend service connection draining options for deployments - google-cloud-platform

Have anybody ever tried to achieve GC HTTP(S) load balancer backend connection draining by either
Setting the capacity of the respective instance groups inside the backend service to 0% (0 RPS)
Removing the instance group(s) from the backend service
Changing the backend service in the URL map to point to another backend service.
I would like to achieve A/B testing deployment with a GCLB in front of two GKE clusters. The docs only say connection draining is triggered for a specific instance when an instance is removed from the instance group (automatically or manually):
https://cloud.google.com/load-balancing/docs/enabling-connection-draining

Those are very particular scenarios, however the expected behaviour is the following:
Setting a max rate per instance or max rate (per instance group) to zero (when the balancing mode is rate), won't drain existing connections. Balancing mode simply helps the load balancer rank backends (instance groups in this situation) from most to least attractive to handle new connections. When the balancing mode is rate and the max RPS is zero, that just means that the backend is "not attractive" even when it is servicing zero requests. But if all backends have RPS set to zero, or if they don't but are near capacity, it's possible that a backend with RPS of zero is equally (as unattractive) as all the other backends.
Removing the instance group as a backend from the backend service will most likely not respect any connection draining because that removes the load balancer from the equation.
This scenario is pretty similar to the above statement, without the downside of removing the load balancer. However I think that pointing the URL map to different backend won't trigger connection draining since instances will be reachable, despite you are refering to a different backend. Downtime is expected, but draining shouldn't be activated.

Related

Marking a compute instance as busy to prevent disrupting connections

I have a Golang service using TCP running on GCP's compute VMs with autoscaling. When the CPU usage spikes, new instances are created and deployed (as expected), but when the CPU usage settles again the instances are destroyed. This would be fine and it's entirely reasonable as to why this is done, but destroying instances does not take into account the established TCP connections and thus disconnects users.
I'd like to keep the VM instances running until the last connection has been closed to prevent disconnecting users. Is there a way to mark the instance as "busy" telling the autoscaler not to remove that instance until it isn't busy? I have implemented health checks but these do not signal the busyness of the instance, only whether the instance is alive or not.
You need to enable Connection Draining for your auto-scaling group:
If the group is part of a backend service that has enabled connection draining, it can take up to 60 seconds after the connection draining duration has elapsed before the VM instance is removed or deleted.
Here are the steps on how to achieve this:
Go to the Load balancing page in the Google Cloud Console.
Click the Edit button for your load balancer or create a new load balancer.
Click Backend configuration.
Click Advanced configurations at the bottom of your backend service.
In the Connection draining timeout field, enter a value from 0 - 3600. A setting of 0 disables connection draining.
Currently you can request connection draining timeout upto 3600s (= 1hour) which should be suffice for your requirements.
see: https://cloud.google.com/compute/docs/autoscaler/understanding-autoscaler-decisions

Limit number of connections to instances with AWS ELB

We are using AWS classic ELB for our service and our service can only serve x number of requests at a time. If the number of requests are greater than x then we do not want to route those requests to the instance and neither do we want to lose those requests. We would like to limit the number of connections to the instances registered with the ELB. Is there some ELB setting to configure max connections to instances?
Another solution I could find was to use ELB connection draining but based on the ELB doc [1] , using connection draining will mark the instance as OutofService after serving in-flight requests. Does that mean the instance will be terminated and de-registered from ELB after in-flight requests are served? We do not want to terminate and de-register the instances, we just want to limit the number of connections to the instances. Any solutions?
[1] http://docs.aws.amazon.com/elasticloadbalancing/latest/classic/config-conn-drain.html
ELB is more meant to spread traffic evenly across instances registered for it. If you have more traffic, you throw up more instances to deal with it. This is generally why a load balancer is matched with an auto scaling group. The Auto Scaling Group will look at set constraints and based on that either spins up more instances or pulls them down (ie. your traffic starts to slow down).
Connection draining is more meant for pulling traffic from bad instances so it doesn't get lost. Bad instances mean they aren't passing health checks because something on the instance is broken. ELB by itself doesn't terminate instances, that's another part of what the Auto Scaling Group is meant to do (basically terminate the bad instance and spin up a new instance to replace it). All ELB does is stop sending traffic to it.
It appears your situation is:
Users are sending API requests to your Load Balancer
You have several instances associated with your Load Balancer to process those requests
You do not appear to be using Auto Scaling
You do not always have sufficient capacity to respond to incoming requests, but you do not want to lose any of the requests
In situations where requests come at a higher rate than you can process them, you basically have three choices:
You could put the messages into a queue and consume them when capacity is available. You could either put everything in a queue (simple), or only use a queue when things are too busy (more complex).
You could scale to handle the load, either by using Auto Scaling to add additional Amazon EC2 instances or by using AWS Lambda to process the requests (Lambda automatically scales).
You could drop requests that you are unable to process. Unless you have implemented a queue, this is going to happen at some point if requests rise above your capacity to process them.
The best solution is to use AWS Lambda functions rather than requiring Amazon EC2 instances. Lambda can tie directly to AWS API Gateway, which can front-end the API requests and provide security, throttling and caching.
The simplest method is to use Auto Scaling to increase the number of instances to try to handle the volume of requests you have arriving. This is best when there are predictable usage patterns, such as high loads during the day and less load at night. It is less useful when spikes occur in short, unpredictable periods.
To fully guarantee no loss of requests, you would need to use a queue. Rather than requests going directly to your application, you would need an initial layer that receives the request and pushes it into a queue. A backend process would then process the message and return a result that is somehow passed back as a response. (It's more difficult providing responses to messages passed via a queue because there is a disconnect between the request and the response.)
AWS ELB is practically no limit to get request. If your application handle only 'N' connection, Please go with multiple servers behind the ELB and set ELB health check URL will be your application URL. Once your application not able to respond the request, ELB automatically forward your request to another server which is behind ELB. So that you are not going to miss any request.

How rate limiting will work when same instance group is behind two different load balancers

I was reading about rate limiting and auto-scaling in GCP and got stuck at this question:
Scenario:
I created a instance group ig with auto-scaling OFF.
I created a load balancer lb1, details are:
lb1 contains a backend service bs1 which points to instance group
ig and Maximum RPS set to 1000 for whole group.
frontend port :8080
path rule : /alpha/*
lb1 is an external load balancer
I created one more load balancer lb2, details are:
lb2 contains a backend service bs2 which points to instance group
ig and Maximum RPS set to 2000 for whole group.
frontend port :9090
path rule : /beta/*
lb2 is an regional load balancer
Question that I have:
Who will monitor the requests served by the both the load balancers?
Which limit will be honoured 1000 or 2000?
Will the overall requests (i.e via lb1 and lb2) will be rate limited or individual limits will be applied for both the request flows?
TL;DR - The RPS is set in the Backend Service, so each load balancer will have its own RPS limit independent of another.
Who will monitor the requests served by the both the load balancers?
Google Compute Engine (GCE) will monitor the requests being served by the load balancers and direct traffic accordingly to stay within the RPS limit of each backend within the backend service.
Which limit will be honoured 1000 or 2000?
1000 with respect to the first load balancer and 2000 with respect to the second load balancer. Remember that the you're using 2 separate backend services bs1 and bs2 for lb1 and lb2 respectively.
Will the overall requests (i.e via lb1 and lb2) will be rate limited or individual limits will be applied for both the request flows?
Requests going through lb1 for bs1 will conform to maximum of 1000 RPS per backend VM. Requests going through lb2 for bs2 will conform to maximum of 2000 RPS per backend VM. So your service running in any given backend VM instance, should be capable of handling at least 3000 RPS.
Longer version
Instance groups do not have a way to specify RPS, only backend services do. Instance groups only help to group a list of instances. So although you could use the same instance groups in multiple backend services, you need to account for the RPS value you set in the corresponding backend service if your goal is to share instances among multiple backend services. GCE will not be able to figure this out automatically.
A backend service represents a micro-service ideally, which is served by a group of backend VMs (from the instance group). You should calculate beforehand how much maximum RPS a single backend instance (i.e. your service running inside the VM) can handle to set this limit. If you intend to share VMs across backend services, you will need to ensure that the combined RPS limit in the worst case is something that your service inside the VM is able to handle.
Google Compute Engine (GCE) will monitor the metrics per backend service (i.e. number of requests per second in your case) and will use that for load balancing. Each load balancer is logically different, and hence there will be no aggregation across load balancers (even if using the same instance group).
Load distribution algorithm
HTTP(S) load balancing provides two methods of determining instance
load. Within the backend service object, the balancingMode property
selects between the requests per second (RPS) and CPU utilization
modes. Both modes allow a maximum value to be specified; the HTTP load
balancer will try to ensure that load remains under the limit, but
short bursts above the limit can occur during failover or load spike
events.
Incoming requests are sent to the region closest to the user, provided
that region has available capacity. If more than one zone is
configured with backends in a region, the traffic is distributed
across the instance groups in each zone according to each group's
capacity. Within the zone, the requests are spread evenly over the
instances using a round-robin algorithm. Round-robin distribution can
be overridden by configuring session affinity.
maxRate and maxRatePerInstance
In the backend service, there are 2 configuration fields related to RPS, one is maxRate and other is maxRatePerInstance. maxRate can be used to set the RPS per group whereas maxRatePerInstance can be used to set the RPS per instance. It looks like both can be used in conjunction if needed.
backends[].maxRate
integer
The max requests per second (RPS) of the
group. Can be used with either RATE or UTILIZATION balancing modes,
but required if RATE mode. For RATE mode, either maxRate or
maxRatePerInstance must be set.
This cannot be used for internal load balancing.
backends[].maxRatePerInstance
float
The max requests per second (RPS)
that a single backend instance can handle.This is used to calculate
the capacity of the group. Can be used in either balancing mode. For
RATE mode, either maxRate or maxRatePerInstance must be set.
This cannot be used for internal load balancing.
Receiving requests at a higher rate than specified RPS
If you happen to receive requests at a rate higher than the RPS and you have autoscaling disabled, I could not find any documentation on the Google Cloud website regarding the exact expected behavior. The closest I could find is this one, where it specifies that the load balancer will try to keep each instance at or below the specified RPS. So it could mean that the requests could get dropped if it exceeds the RPS, and clients might see one of the 5XX error codes (possibly 502) based on this:
failed_to_pick_backend
The load balancer failed to pick a healthy backend to handle the
request.
502
You could probably figure it out the hard way by setting a fairly low RPS like 10 or 20 and see what happens. Look at the timestamps at which you receive the requests on your backend to determine the behavior. Also, the limiting might not happen on exactly the 11th or 21st request, so try sending far more than that per second to verify if the requests are being dropped.
With Autoscaling
If you enable autoscaling though, this will automatically trigger the autoscaler and make it expand the number of instances in the instance group based on the target utilization level you set in the Autoscaler.
NOTE: Updated answer since you actually specified that you're using 2 separate backend services.

AWS Autoscalling Rolling over Connections to new Instances

Is it possible to automatically 'roll over' a connection between auto scaled instances?
Given instances which provide a compute intensive service, we would like to
Autoscale a new instance after CPU reachs say 90%
Have requests for service handled by the new instance.
It does not appear that there is a way with the AWS Dashboard to set this up, or have I missed something?
What you're looking for is a load balancer. If you're using HTTP, this works pretty much out of the box. Clients open connections to the load balancer, which then distributes individual HTTP requests from the connection evenly across instances in your auto scaling group. When a new instance joins the group, the load balancer automatically shifts a portion of the incoming requests over to the new instance.
Things get a bit trickier if you're speaking a protocol other than HTTP(S). A generic TCP load balancer can't tell where one "request" ends and the next begins (or if that even makes sense for your protocol), so incoming TCP connections get mapped directly to a particular backend host. The load balancer will route new connections to the new instance when it spins up, but it can't migrate existing connections over.
Typically what you'll want to do in this scenario is to have clients periodically close their connections to the service and create new ones - especially if they're seeing increased latencies or other evidence that the instance they're talking to is overworked.

Prevent machine on Amazon from shutting down before all users finished tasks

I'm planning a server environment on AWS with auto scaling over VPC.
My application has some process that is done in several steps on server, and the user should stick to the same server by using ELB's sticky session.
The problem is, that when the auto scaling group suppose to shut down server, some users may be in the middle of the process (the process takes multiple request - for example -
1. create an album
2. upload photos to the album each at a time
3. convert photos to movie and delete photos
4. store movie on S3)
Is it possible to configure the ELB to stop passing NEW users to the server that is about to shut down, while still passing previous users (that has the sticky session set)?, and - is it possible to tell the server to wait for, let's say, 10 min. after the shutdown rule applied before it actually shut down?
Thank you very much
This feature hasn't been available in Elastic Load Balancing at the time of your question, however, AWS has meanwhile addressed the main part of your question by adding ELB Connection Draining to avoid breaking open network connections while taking an instance out of service, updating its software, or replacing it with a fresh instance that contains updated software.
Please not that you still need to specify a sufficiently large timeout based on the maximum time you expect users to finish their activity, see Connection Draining:
When you enable connection draining for your load balancer, you can set a maximum time for the load balancer to continue serving in-flight requests to the deregistering instance before the load balancer closes the connection. The load balancer forcibly closes connections to the deregistering instance when the maximum time limit is reached.
[...]
If your instances are part of an Auto Scaling group and if connection draining is enabled for your load balancer, Auto Scaling will wait for the in-flight requests to complete or for the maximum timeout to expire, whichever comes first, before terminating instances due to a scaling event or health check replacement. [...] [emphasis mine]
The emphasized part confirms that it is not possible to specify an additional timeout that only applies after the last connection has been drained.