i have over a hundred servers that are sending metrics to my statsd-graphite setup. A leaf out of the metric subtree is something like-
stats.dev.medusa.ip-10-0-30-61.jaguar.v4.outbox.get
stats.dev.medusa.ip-10-0-30-62.jaguar.v4.outbox.get
Most of my crawlers are AWS spot-instances, which implies that 20s of them go down and up randomly, being allocated different IP addresses every time. This implies that the same list becomes-
stats.dev.medusa.ip-10-0-30-6.<|subtree
stats.dev.medusa.ip-10-0-30-1.<|subtree
stats.dev.medusa.ip-10-0-30-26.<|subtree
stats.dev.medusa.ip-10-0-30-21.<|subtree
Assuming all the metrics under the subtree in total store 4G metrics, 20 spot instances going down and later 30 of them spawning up with different IP addresses implies that my storage suddenly puffs up by 120G. Moreover, this is a weekly occurrence.
While it is simple and straightforward to delete the older IP-subtrees, but i really want to retain the metrics. i can have 3 medusas at week0, 23 medusas at week1, 15 in week2, 40 in week4. What can be my options? How would you tackle this?
We achieve this by not logging the ip address. Use a deterministic locking concept and when instances come up they request a machine id. They can then use this machine id instead of the ip address for the statsd bucket.
stats.dev.medusa.machine-1.<|subtree
stats.dev.medusa.machine-2.<|subtree
This will mean you should only have up to 40 of these buckets. We are using this concept successfully, with a simple number allocator api on a separate machine that allocates the instance numbers. Once a machine has an instance number it stores it as a tag on that machine, so our allocator can query the tags of the ec2 instances to see what is being used at the moment. This allows it to re-allocate old machine ids.
Related
Background
I have an AWS managed Elascsearch v6.0 cluster that has 14 data instances.
It has time based indices like data-2010-01, ..., data-2020-01.
Problem
Free storage space is very unbalanced across instances, which I can see in the AWS console:
I have noticed this distribution changes every time the AWS services runs through a blue-green deploy.
This happens when cluster settings are changed or AWS releases an update.
Sometimes the blue-green results in one of the instances completely running out of space.
When this happens the AWS service starts another blue-green and this resolves the issue without customer impact. (It does have impact on my heart rate though!)
Shard Size
Shards size for our indices are gigabytes in size but below the Elasticsearch recommendation of 50GB.
The shard size does vary by index, though. Lots of our older indices have only a handful of documents.
Question
The way the AWS balancing algorithm does not balance well, and that it results in a different result each time is unexpected.
My question is how does the algorithm choose which shards to allocate to which instance and can I resolve this imbalance myself?
I asked this question of AWS support who were able to give me a good answer so I thought I'd share the summary here for others.
In short:
AWS Elasticsearch distributes shards based on shard count rather than shard size so keep your shard sizes balanced if you can.
If you have your cluster configured to be spread across 3 availability zones, make your data instance count a divisible by 3.
My Case
Each of my 14 instances gets ~100 shards instead of ~100 GB each.
Remember that I have a lot of relatively empty indices.
This translates to a mixture of small and large shards which causes the imbalance when AWS Elasticsearch (inadvertently) allocates lots of large shards to an instance.
This is further worsened by the fact that I have my cluster set to be distributed across 3 availability zones and my data instance count (14) is not divisible by 3.
Increasing my data instance count to 15 (or decreasing to 12) solved the problem.
From the AWS Elasticsearch docs on Multi-AZ:
To avoid these kinds of situations, which can strain individual nodes and hurt performance, we recommend that you choose an instance count that is a multiple of three if you plan to have two or more replicas per index.
Further Improvement
On top of the availability zone issue, I suggest keeping index sizes balanced to make it easier for the AWS algorithm.
In my case I can merge older indexes, e.g. data-2019-01 ... data-2019-12 -> data-2019.
I have a multi-day analysis problem that I am running on a 72 cpu c5n EC2 instance. To get spot pricing, I made my code interruption-resilient and am launching a spot request of one instance. It works great, but this seems like overkill given that Spot can handle thousands of instances. Is this the correct way to solve my problem or am I using a sledgehammer to squash a fly?
I've tried normal EC2 launching, which works great, except that it is four times the price. I don't know of any other way to approach this except for these two ways. I thought about Fargate or containers or something, but I am running a 72 cpu c5n node, and those other options won't let me use that kind of horsepower (that I know of, hence my question).
Thanks!
Amazon EC2 Spot Instances are an excellent way to get cheaper compute (up to 90% discount). The only downside is that the instances might be stopped/terminated (your choice) if there is insufficient capacity.
Some strategies to improve your chance of obtaining spot instances:
Use instances across different Instance Types and Availability Zones because they each have different availability pools (EC2 Spot Fleet can assist with this)
Use resources on weekends and in evenings (even in different regions!) because these tend to be times of lower usage
Use Spot Instances with a specified duration (also known as Spot blocks), but this is at a higher price and a maximum duration of 6 hours
If your software permits it, you could split your load between multiple instances to get the job done faster and to be more resilient against any stoppages of your Spot instances.
Hopefully your application is taking advantage of all the CPUs, otherwise you'd be better-off with smaller instances.
Currently i'm building a chat application base on NodeJs
So i considered choose which is the best instance type for our server?
Because AWS have a lot of choice: General purpose, compute optimize, memory optimize ....
Could you please give me advise :(
You can read this - https://aws.amazon.com/blogs/aws/choosing-the-right-ec2-instance-type-for-your-application/
Actually it doesn't matter what hosting you chose -AWS, MS Azure, Google Compute Engine etc...
If you want to get as much as you can from your servers and infrastructure, you need to solve your current task.
First of all decide how many active users at the same time you will get in closest 3-6 months.
If there will be less than 1000k active users (connections) per second - I think you can start from the smallest instance type. You should check how you can increase CPU/RAM/HDD(or SSD) of your instance.
SO when you get more users you will have a plan how to speed up your server.
And keep an eye on your server analytics - CPU/RAM/IO utilizations when you are getting more and more users.
The other questions if you need to pass some certifications related to security restrictions...
Since you are not quite sure where to start with, I would recommend to start with General Purpose EC2 instance for production from M category (M3 or M4). You can start with smaller instance type like m3.medium.
Note: If its an internal chat application with low traffic you can even consider T series EC2 instances.
The important part here is not to try to predict the capacity needs. Instead you can start small with general purpose EC2 instance and down the line looking at the resource consumption of EC2 instance you can do a proper capacity planning. Since you can both Scale the instances Horizontally and Vertically, it will require to trade of the instance type also considering Cost and timely load requirements before selecting the scaling unit of EC2 instance.
One of the approach I'm following is as follows
Start with General Purpose Instance (Unless I'm confident that there are special needs such as Networking, IO & etc.)
Do a load test(Without Autoscaling for a single EC2 instance) of the application by changing the number of users and find out the limits (How many users can a single EC2 instance can handle).
After analyzing the Memory, CPU & IO utilization, you can also consider shifting to a different EC2 category or stick with the same type. (Lets say CPU goes to its limit but memory is hardly used, you can consider using C series instances).
Scale the EC2 instance vertically by moving to the next size (e.g m3.medium to m3.large) and carry out the load tests to find out its limits.
After repeating step, 3 and 4 you can find an optimal balance between Cost and Performance.
Lets take 3 instance types with cost as X for the lowest selected (Since increasing the EC2 size in one unit, makes the cost doubles)
m3.medium - can serve 100 users, cost X
m3.large - can serve 220 users, cost 2X
m3.xlarge - can serve 300 users. cost 3X
Its an easy choice to select m3.large as the EC2 instance size since it can serve 110 per X cost.
However its not straight forward for some applications where you need to decide the instance type based on your average expected load.
Setup autoscaling and load balancing to horizontally scale the EC2 instances to handle load above average.
For more details, refer the Architecting for the Cloud: Best Practices whitepaper.
I would recommend starting with a T2.micro Linux instance. Watch the CPU usage in CloudWatch. Once the CPU usage starts to exceed 50% to 75%, or free memory gets low, or disk I/O gets saturated, switch to the next larger instance.
T2.micro Linux instances are (for the most part) free. Read the fine print. T2.micro instances are burstable which means that you can get good performance from a small instance.
Unless your chat application has a huge customer / transaction base, you (probably) won't need the other instance types.
I have created few AWS EC2 instances, however, sometimes, my data throughput (both for upload and download) are becoming highly limited on certain servers.
For example, typically I have about 15-17 MB/s throughput from instance located in US West (Oregon) server. However, sometimes, especially when I transfer a large amount of data in a single day, my throughput drops to 1-2 MB/s. When it happens on one server, the other servers have a typical network throughput (as previously expect).
How can I avoid it? And what can cause this?
If it is due to amount of my data upload/download, how can I avoid it?
At the moment, I am using t2.micro type instances.
Simple answer, don't use micro instances.
AWS is a multi-tenant environment as such resource are shared. When it comes to network performance, the larger instance sizes get higher priority. Only the largest instances get any sort of dedicated performance.
Micro and nano instances get the lowest priority out of all instances types.
This matrix will show you what priority each instance size gets:
https://aws.amazon.com/ec2/instance-types/#instance-type-matrix
This is probably too simple a question, but how does one create many instances (in the low hundreds).
Our process can require up to ten instance/task and we can need to run up to a dozen instances.
However, the upper limit of instances we can run/account is 10-20.
Up until now, we have gone around this with multiple AWS accounts, which is creaky to say the least. We would prefer something more like a large cluster.
Is there a way of upping the limit programmatically, or does one have to make a special AWS request?
Thanks.
All accounts have a default limit of around 20 instances. This limit can easily be increased through the AWS console.
You can do this by following these steps:
-Go to the AWS Console to this link
https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#Limits:
-Click Request limit increase for each instance type you want increased
-You'll need to do this for each region
Provided you don't submit a crazy high limit request (1000) instances, your request should be approved automatically within a few minutes.
See the Docs for more info EC2 limits:
http://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html#limits_ec2