In GKE, how to minimize connect time with Load balancer

In GKE, how to minimize connect time with Load balancer - google-cloud-platform

In GKE, for cost-saving, I usually put the node number to zero. When I autoscale nodes(or say add) and run the pods. It takes more than 6-7 mins to connect to Loadbalancer and up the URL. That's why health checks in the waiting state. Is there any way to reduce the time? Thanks

If Cloud Functions is not an option, you might want to look at Cloud Run (which supports containers and scales to zero) or GKE Autopilot (which does not scale to zero, but you can scale down to low resource and it will autoscale up and down as needed)

In short not really. Spinning up time of nodes is not easily controlled, basically it is the time that will take for the VM to be allocated, turned on, boot the OS and do some other stuff related to Kubernetes (like configuration, adding to node pool, etc) this takes time! In addition to Pods spinning up time which depends on the Docker image (size/dependencies etc).
Scaling down your application to zero nodes is not very recommended. It is always recommended to have some nodes up (don’t you have other apps running on the GKE cluster? Kubernetes clusters are recommended to have at least 3 nodes running).
Have you considered using Cloud Functions? Is it possible in your case? This the the best option I know of for a quick scale up and zero scale down.
And in general you can keep some kind of “ping” to the function to keep it “hot” for a relatively cheap price.
If none of the options above is possible (id say keeping your node pool with at least 3 nodes operating, is best as it is takes time for the Kubernetes control plan to boot). I suggest starting with reducing the spinning up time of your Pods by improving the Docker image - reducing its size etc.
Here are some articles on how to reduce Docker image size
https://phoenixnap.com/kb/docker-image-size
https://www.ardanlabs.com/blog/2020/02/docker-images-part1-reducing-image-size.html
After that I will experiment with different machine types for node to check which one is spinning the fastest - could be an interesting thing to do in any case
Here is an interesting comparison on VM spinning up times
https://www.google.com/amp/s/blog.cloud66.com/part-2-comparing-the-speed-of-vm-creation-and-ssh-access-on-aws-digitalocean-linode-vexxhost-google-cloud-rackspace-packet-cloud-a-and-microsoft-azure/amp/

Related

AWS Serverless for Microservices and true "pay-as-you-use"

Premise
I'm trying to come up with the right choice of AWS construct for a containerized microservice (set of microservices in fact) deployment. The application will have an average load of 50% through the day and little to nothing during the night and at very specific times in the day(which is not always pre-determinable) there is a burst of high-volume requests. Also, it's not a super-busy set of microservices ( in other words, 2 instances of 1VCPU and 8GB RAM will just be fine )
The fargate compute option seems to be a better option for this type of a setup, except of course that
When my application has little or no load during the night, I will still be charged for the full 1VCPU and 8GB (which according to me is not true "pay as you use" as I might be using only 0.05 or 0.25 VCPU - hypothetical numbers )
The only way to get around this is to write some redefinition strategies myself: watch Cloudwatch events and recreate the Fargate tasks with lesser VCPU. However, it will have some extra overhead in terms of deployment time (even if I ensure staggered deployments, it still means a 'lot of work' each time there is a material event). Is there a better way to do this or is there a more 'truly' out of the box pay-as-you-use arrangement that can let you consume resources in a range continuously based on what you actually are using at that moment without having to jump through the hoop?
Lastly, the purist in me still cannot reconcile in theory the fact that a microservice isn't a 'task' really and use of a Fargate compute option doesn't sound intuitively right to me even if I could think of a microservice as an extreme case of a task running permanently Costwise, am I better off using EC2 as some options seem to get me a cost that is lesser than Fargate (I'm aware of the additional responsibility in maintaining/patching those EC2 instances )?

Flask application scaling on Kubernetes and Gunicorn

We have a Flask application that is served via gunicorn, using the eventlet worker. We're deploying the application in a kubernetes pod, with the idea of scaling the number of pods depending on workload.
The recommended settings for the number of workers in gunicorn is 2 - 4 x $NUM_CPUS. See docs. I've previously deployed services on dedicated physical hardware where such calculations made sense. On a 4 core machine, having 16 workers sounds OK and we eventually bumped it to 32 workers.
Does this calculation still apply in a kubernetes pod using an async worker particularly as:
There could be multiple pods on a single node.
The same service will be run in multiple pods.
How should I set the number of gunicorn workers?
Set it to -w 1 and let kubernetes handle the scaling via pods?
Set it to 2-4 x $NUM_CPU on the kubernetes nodes. On one pod or multiple?
Something else entirely?
Update
We decided to go with the 1st option, which is our current approach. Set the number of gunicorn works to 1, and scale horizontally by increasing the number of pods. Otherwise there will be too many moving parts plus we won't be leveraging Kubernetes to its full potential.

For better visibility of the final solution chosen by original author of this question as of 2019 year
Set the number of gunicorn works to 1 (-w 1), and scale horizontally
by increasing the number of pods (using Kubernetes HPA).
and the fact it might be not applicable in the close future, taking into account fast growth of workload related features in Kubernetes platform, e.g. some distributions of Kubernetes propose beside HPA, Vertical Pod Autoscaling (VPA) and Multidimensional Pod autoscaling (MPA) too, so I propose to continue this thread in form of community wiki post.

I'am not developer and it seems not simple task, but for your considerations please follow bests practices for Better performance by optimizing Gunicorn config.
In addition in kubernetes there are different mechanisms in order to scale your deployment like HPA due to CPU utilization and (How is Python scaling with Gunicorn and Kubernetes?)
You can use also Resource requests and limits of Pod and Container.
As per Gunicorn documentation
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
# update:
Depending on your approach you can choose different solution (deployment, daemonset) all above statements you can achieve in kubernetes by handling according Assigning CPU Resources to Containers and Pods
Using deployment with resources (limits,requests) give you possibility to resize your app into multiple pods on a single node based on your hardware limits but depending on your "app load" it can not be good enough solution.
CPU requests and limits are associated with Containers, but it is useful to think of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for all the Containers in the Pod.
Note:
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
f.e. 1 GCP Core.
As mentioned in the post the second approach (scaling your app into multiple nodes) it's also good choice. In this case you can cosnider using f.e. Statefulset or deployment in addition on GKE using "cluster austoscaler" you can achieve more extendable solution when you try to create new pods that don't have enough capacity to run inside the cluster. In this case cluster autoscaler automatically add additional resources.
On the other hand you can consider using different other solutions like Cerebral it gives you the possibility to create user-defined policies in order to increasing or decreasing the size of pools of nodes inside your cluster.
GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.
Please keep in mind that the question is very general and there is no one good answer for this topic. You should consider all prons and cons based on your requirements, load, activity, capacity, costs ...
Hope this help.

Would it be best to scale fewer larger instances, or more smaller instances?

what will be the best option to choose b/w less number of large instances or more number of the small instance when the performance is concerned, as the cloudwatch (load balancing and scaling) will be used if the traffic floods on the servers.

AWS is all about ELASTICITY
There is no need to provision large instances when not needed and burn out money.
There can be many instances when your CPU on one goes high and the next large instance you created remains under-utilized.
You should have medium instances to small w.r.t the tier you require (Memory Intensive, CPU, or Network) and scale those instances with properly written policies.
As long as the userdata, ami is stable you can spawn many instances within minutes making sure you are not spending way too much and saving every Penny.
SCALE WHEN NEEDED HORIZONTALLY

This is heavily dependent on your application.
I agree with Faisal Nizam's intuition of favoring horizontal scaling. However, there are many applications that will not run very well on small instances.
For example, Elastic recommends to have Elasticsearch cluster nodes with 64GB of RAM. Similar reasoning can be applied to many other data related applications, where it can be beneficial if a single instance is able to keep large data chunks in memory.
I would recommend to find the ideal instance size for your application, and from there scale horizontally.

Each EC2 has also some overhead, so you need to find a balance between large & costly instances vs. a lot and small instances with overhead.

(As of today) To vertically scale up/scale down an EC2 server, it needs to be shut down and spun back up - something to keep in mind before deciding to go for it.

What AWS EC2 Instance Types suitable for chat application?

Currently i'm building a chat application base on NodeJs
So i considered choose which is the best instance type for our server?
Because AWS have a lot of choice: General purpose, compute optimize, memory optimize ....
Could you please give me advise :(

You can read this - https://aws.amazon.com/blogs/aws/choosing-the-right-ec2-instance-type-for-your-application/
Actually it doesn't matter what hosting you chose -AWS, MS Azure, Google Compute Engine etc...
If you want to get as much as you can from your servers and infrastructure, you need to solve your current task.
First of all decide how many active users at the same time you will get in closest 3-6 months.
If there will be less than 1000k active users (connections) per second - I think you can start from the smallest instance type. You should check how you can increase CPU/RAM/HDD(or SSD) of your instance.
SO when you get more users you will have a plan how to speed up your server.
And keep an eye on your server analytics - CPU/RAM/IO utilizations when you are getting more and more users.
The other questions if you need to pass some certifications related to security restrictions...

Since you are not quite sure where to start with, I would recommend to start with General Purpose EC2 instance for production from M category (M3 or M4). You can start with smaller instance type like m3.medium.
Note: If its an internal chat application with low traffic you can even consider T series EC2 instances.
The important part here is not to try to predict the capacity needs. Instead you can start small with general purpose EC2 instance and down the line looking at the resource consumption of EC2 instance you can do a proper capacity planning. Since you can both Scale the instances Horizontally and Vertically, it will require to trade of the instance type also considering Cost and timely load requirements before selecting the scaling unit of EC2 instance.
One of the approach I'm following is as follows
Start with General Purpose Instance (Unless I'm confident that there are special needs such as Networking, IO & etc.)
Do a load test(Without Autoscaling for a single EC2 instance) of the application by changing the number of users and find out the limits (How many users can a single EC2 instance can handle).
After analyzing the Memory, CPU & IO utilization, you can also consider shifting to a different EC2 category or stick with the same type. (Lets say CPU goes to its limit but memory is hardly used, you can consider using C series instances).
Scale the EC2 instance vertically by moving to the next size (e.g m3.medium to m3.large) and carry out the load tests to find out its limits.
After repeating step, 3 and 4 you can find an optimal balance between Cost and Performance.
Lets take 3 instance types with cost as X for the lowest selected (Since increasing the EC2 size in one unit, makes the cost doubles)
m3.medium - can serve 100 users, cost X
m3.large - can serve 220 users, cost 2X
m3.xlarge - can serve 300 users. cost 3X
Its an easy choice to select m3.large as the EC2 instance size since it can serve 110 per X cost.
However its not straight forward for some applications where you need to decide the instance type based on your average expected load.
Setup autoscaling and load balancing to horizontally scale the EC2 instances to handle load above average.
For more details, refer the Architecting for the Cloud: Best Practices whitepaper.

I would recommend starting with a T2.micro Linux instance. Watch the CPU usage in CloudWatch. Once the CPU usage starts to exceed 50% to 75%, or free memory gets low, or disk I/O gets saturated, switch to the next larger instance.
T2.micro Linux instances are (for the most part) free. Read the fine print. T2.micro instances are burstable which means that you can get good performance from a small instance.
Unless your chat application has a huge customer / transaction base, you (probably) won't need the other instance types.

multiple micro vs. one large ec2 instance

Our website is getting slow and we are in need of an upgrade.
We are currently AWS and have 1 micro ec2 instance that proved effective while our website had less traffic. Now when we get more traffic, our site is getting slower.
We can't seem to settle an argument.
Which would be better:
Adding multiple additional micro/small instances and have them managed either by nginx or amazon cloud computing
OR
Upgrading our micro instance into a large/xlarge instance.
which would be more effective considering the tasks to be performed by the server are simple, and considering the total amount of ram and processing power is similar. 1 big, or many small?
Thanks

Tough to say -
Option #2 is going to be the easiest to do, turn your server off, resize it, turn it back on get more capacity just by paying more money. Easy to do, but maybe not the best long-term solution. What will you do when traffic continues to increase (either constantly or at certain times) and there are no more gains to be had simply by picking a bigger box?
Option #1 is going to be more work, but ultimately maybe a better strategy.
First of all, you didn't say if you have a constant need for more throughput, or if it is certain times of the day/week/month/year when the capacity is needed - if that is the case, multiple EC2 instances with auto-scale groups setup to respond to increases and decreases in demand by turning on additional instances as needed and then turning them off as demand decreases is a cost-effective option.
In addition, having multiple instances running - preferable in different availability zones, gives you fault-tolerance - when your big instance in #1 goes down, your website is down - if you have many small instances running across 2 or 3 availability zones, you can continue to function if one or more or your instances goes down, and even if AWS availability zone goes offline (rare, but it happens).
Besides the options above, without knowing anything about your application - other things you can do - move some static assets to S3 and/or use AWS cloudfront (or other CDN) to offload some of the work - this is often a cheap and easy way to get more out of an existing box.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js