Capacity planning on AWS - amazon-web-services

I need some understanding on how to do capacity planning for AWS and what kind of infrastructure components to use. I am taking the below example.
I need to setup a nodejs based server which uses kafka, redis, mongodb. There will be 250 devices connecting to the server and sending in data every 10 seconds. Size of each data packet will be approximately 10kb. I will be using the 64bit ubuntu image
What I need to estimate,
MongoDB requires atleast 3 servers for redundancy. How do I estimate the size of the VM and EBS volume required e.g. should be m4.large, m4.xlarge or something else? Default EBS volume size is 30GB.
What should be the size of the VM for running the other application components which include 3-4 processes of nodejs, kafka and redis? e.g. should be m4.large, m4.xlarge or something else?
Can I keep just one application server in an autoscaling group and increase as them as the load increases or should i go with minimum 2
I want to generally understand that given the number of devices, data packet size and data frequency, how do we go about estimating which VM to consider and how much storage to consider and perhaps any other considerations too

Nobody can answer this question for you. It all depends on your application and usage patterns.
The only way to correctly answer this question is to deploy some infrastructure and simulate standard usage while measuring the performance of the systems (throughput, latency, disk access, memory, CPU load, etc).
Then, modify the infrastructure (add/remove instances, change instance types, etc) and measure again.
You should certainly run a minimal deployment per your requirements (eg instances in separate Availability Zones for High Availability) and you can use Auto Scaling to add extra capacity when required, but simulated testing would also be required to determine the right triggers points where more capacity should be added. For example, the best indicator might be memory, or CPU, or latency. It all depends on the application and how it behaves under load.

Related

Increase vCPUS/RAM if needed

I have create a AWS EC2 instance to run a computation routine that works for most cases, however every now and then I get an user that needs to run a computation routine that crashes my program due to lack of RAM.
Is it possible to scale the EC2 instance's RAM and or vCPUs if required or if certain threshold (say when 80% of RAM is used) is reached. What I'm trying to avoid is keeping and unnecessary large instance and only scale resources when needed.
It is not possible to adjust the amount of vCPUs or RAM on an Amazon EC2 instance.
Instead, you must:
Stop the instance
Change the Instance Type
Start the instance
The virtual machine will be provisioned on a different 'host' computer that has the correct resources matched to the Instance Type.
A common approach is to scale the Quantity of instances to handle the workload. This is known as horizontal scaling and works well where work can be distributed amongst multiple computers rather than making a single computer 'bigger' (which is 'Vertical Scaling').
The only exception to the above is when using Burstable performance instances - Amazon Elastic Compute Cloud, which are capable of providing high amounts of CPU but only for limited periods. This is great when you have bursty needs (eg hourly processing or spiky workloads) but should not be used when there is a need for consistent high workloads.

AWS Network out

Our web application has 5 pages (Signin, Dashboard, Map, Devices, Notification)
We have done the load test for this application, and load test script does the following:
Signin and go to Dashboard page
Click Map
Click Devices
Click Notification
We have a basic free plan in AWS.
While performing load test, till about 100 users, we didn’t get any error. please see the below image. We could see NetworkIn, CPUUtilization seems to be normal. But the NetworkOut showed 846K.
But when reach around 114 users, we started getting error in the map page (highlighted in red). During that time, it seems only NetworkOut is high. Please see the below image.
We want to know what is the optimal score for the NetworkOut, If this number is high, is there any way to reduce this number?
Please let me know if you need more information. Thanks in advance for your help.
You are using a t2.micro instance.
This instance type has limitations on CPU that means it is good for bursty workloads, but sustained loads will consume all the available CPU credits. Thus, it might perform poorly under sustained loads over long periods.
The instance also has limited network bandwidth that might impact the throughput of the server. While all Amazon EC2 instances have limited allocations of bandwidth, the t2.micro and t2.nano have particularly low bandwidth allocations. You can see this when copying data to/from the instance and it might be impacting your workloads during testing.
The t2 family, especially at the low-end, is not a good choice for production workloads. It is great for workloads that are sometimes high, but not consistently high. It is also particularly low-cost, but please realise that there are trade-offs for such a low cost.
See:
Amazon EC2 T2 Instances – Amazon Web Services (AWS)
CPU Credits and Baseline Performance for Burstable Performance Instances - Amazon Elastic Compute Cloud
Unlimited Mode for Burstable Performance Instances - Amazon Elastic Compute Cloud
That said, the network throughput showing on the graphs is a result of your application. While the t2 might be limiting the throughput, it is not responsible for the spike on the graph. For that, you will need to investigate the resources being used by the application(s) themselves.
NetworkOut simply refers to volume of outgoing traffic from the instance. You reduce the requests you are sending from this instance to reduce the NetworkOut .So you may need to see which one of click Map, Click Devices and Click Notification is sending traffic outside of the instances. It may not necessarily related only to the number of users but a combination of number of users and application module.

What AWS EC2 Instance Types suitable for chat application?

Currently i'm building a chat application base on NodeJs
So i considered choose which is the best instance type for our server?
Because AWS have a lot of choice: General purpose, compute optimize, memory optimize ....
Could you please give me advise :(
You can read this - https://aws.amazon.com/blogs/aws/choosing-the-right-ec2-instance-type-for-your-application/
Actually it doesn't matter what hosting you chose -AWS, MS Azure, Google Compute Engine etc...
If you want to get as much as you can from your servers and infrastructure, you need to solve your current task.
First of all decide how many active users at the same time you will get in closest 3-6 months.
If there will be less than 1000k active users (connections) per second - I think you can start from the smallest instance type. You should check how you can increase CPU/RAM/HDD(or SSD) of your instance.
SO when you get more users you will have a plan how to speed up your server.
And keep an eye on your server analytics - CPU/RAM/IO utilizations when you are getting more and more users.
The other questions if you need to pass some certifications related to security restrictions...
Since you are not quite sure where to start with, I would recommend to start with General Purpose EC2 instance for production from M category (M3 or M4). You can start with smaller instance type like m3.medium.
Note: If its an internal chat application with low traffic you can even consider T series EC2 instances.
The important part here is not to try to predict the capacity needs. Instead you can start small with general purpose EC2 instance and down the line looking at the resource consumption of EC2 instance you can do a proper capacity planning. Since you can both Scale the instances Horizontally and Vertically, it will require to trade of the instance type also considering Cost and timely load requirements before selecting the scaling unit of EC2 instance.
One of the approach I'm following is as follows
Start with General Purpose Instance (Unless I'm confident that there are special needs such as Networking, IO & etc.)
Do a load test(Without Autoscaling for a single EC2 instance) of the application by changing the number of users and find out the limits (How many users can a single EC2 instance can handle).
After analyzing the Memory, CPU & IO utilization, you can also consider shifting to a different EC2 category or stick with the same type. (Lets say CPU goes to its limit but memory is hardly used, you can consider using C series instances).
Scale the EC2 instance vertically by moving to the next size (e.g m3.medium to m3.large) and carry out the load tests to find out its limits.
After repeating step, 3 and 4 you can find an optimal balance between Cost and Performance.
Lets take 3 instance types with cost as X for the lowest selected (Since increasing the EC2 size in one unit, makes the cost doubles)
m3.medium - can serve 100 users, cost X
m3.large - can serve 220 users, cost 2X
m3.xlarge - can serve 300 users. cost 3X
Its an easy choice to select m3.large as the EC2 instance size since it can serve 110 per X cost.
However its not straight forward for some applications where you need to decide the instance type based on your average expected load.
Setup autoscaling and load balancing to horizontally scale the EC2 instances to handle load above average.
For more details, refer the Architecting for the Cloud: Best Practices whitepaper.
I would recommend starting with a T2.micro Linux instance. Watch the CPU usage in CloudWatch. Once the CPU usage starts to exceed 50% to 75%, or free memory gets low, or disk I/O gets saturated, switch to the next larger instance.
T2.micro Linux instances are (for the most part) free. Read the fine print. T2.micro instances are burstable which means that you can get good performance from a small instance.
Unless your chat application has a huge customer / transaction base, you (probably) won't need the other instance types.

Ensuring consistent network throughput from AWS EC2 instance?

I have created few AWS EC2 instances, however, sometimes, my data throughput (both for upload and download) are becoming highly limited on certain servers.
For example, typically I have about 15-17 MB/s throughput from instance located in US West (Oregon) server. However, sometimes, especially when I transfer a large amount of data in a single day, my throughput drops to 1-2 MB/s. When it happens on one server, the other servers have a typical network throughput (as previously expect).
How can I avoid it? And what can cause this?
If it is due to amount of my data upload/download, how can I avoid it?
At the moment, I am using t2.micro type instances.
Simple answer, don't use micro instances.
AWS is a multi-tenant environment as such resource are shared. When it comes to network performance, the larger instance sizes get higher priority. Only the largest instances get any sort of dedicated performance.
Micro and nano instances get the lowest priority out of all instances types.
This matrix will show you what priority each instance size gets:
https://aws.amazon.com/ec2/instance-types/#instance-type-matrix

Which aws instance type is optimal to improve spark shuffle performance?

For my spark application I'm trying to determine whether I should be using 10 r3.8xlarge or 40 r3.2xlarge. I'm mostly concerned with shuffle performance of the application.
If I go with r3.8xlarge I will need to configure 4 worker instances per machine to keep the JVM size down. The worker instances will likely contend with each other for network and disk I/O if they are on the same machine. If I go with 40 r3.2xlarge I will be able to allocate a single worker instance per box, allowing each worker instance to have its own dedicated network and disk I/O.
Since shuffle performance is heavily impacted by disk and network throughput, it seems like going with 40 r3.2xlarge would be the better configuration between the two. Is my analysis correct? Are there other tradeoffs that I'm not taking into account? Does spark bypass the network transfer and read straight from local disk if worker instances are on the same machine?
Seems you have the answer already : it seems like going with 40 r3.2xlarge would be the better configuration between the two.
Recommend you go through aws well architect.
General Design Principles
The Well-Architected Framework identifies a set of general design principles to
facilitate good design in the cloud:
Stop guessing your capacity needs: Eliminate guessing your
infrastructure capacity needs. When you make a capacity decision before
you deploy a system, you might end up sitting on expensive idle resources
or dealing with the performance implications of limited capacity. With
cloud computing, these problems can go away. You can use as much or as
little capacity as you need, and scale up and down automatically.
Test systems at production scale: In a traditional, non-cloud
environment, it is usually cost-prohibitive to create a duplicate
environment solely for testing. Consequently, most test environments are
not tested at live levels of production demand. In the cloud, you can create
a duplicate environment on demand, complete your testing, and then
decommission the resources. Because you only pay for the test
environment when it is running, you can simulate your live environment
for a fraction of the cost of testing on premises.
refer:
AWS Well-Architected Framework