I just implemented a few alarms with CloudWatch last week and I noticed a strange behavior with EC2 small instances everyday between 6h30 and 6h45 (UTC time).
I implemented one alarm to warn me when a AutoScallingGroup have its CPU over 50% during 3 minutes (average sample) and another alarm to warn me when the same AutoScallingGroup goes back to normal, which I considered to be CPU under 30% during 3 minutes (also average sample). Did that 2 times: one for zone A, and another time for zone B.
Looks OK, but something is happening during 6h30 to 6h45 that takes certain amount of processing for 2 to 5 min. The CPU rises, sometimes trigger the "High use alarm", but always triggers the "returned to normal alarm". Our system is currently under early stages of development, so no user have access to it, and we don't have any process/backups/etc scheduled. We barely have Apache+PHP installed and configured, so it can only be something related to the host machines I guess.
Anybody can explain what is going on and how can we solve it besides increasing the sample time or % in the "return to normal" alarm? Folks at Amazon forum said the Service Team would have a look once they get a chance, but it's been almost a week with no return.
Related
I am trying to setup a function which will be working somewhere on the server. It is a simple GET request and I want to trigger it every second.
I tried google cloud functions and AWS. Both of them don't have a straightforward solution to run it every second. (every 1 minute only)
Could you please suggest me a service, or combination of services that will allow me to do it. (preferably not costly)
Here are some options on AWS ...
Launch a t2.nano EC2 instance to run a script that issues GET, then sleeps for 1 second, and repeats. You can't use cron (doesn't support every second). This costs about 13 cents per day.
If you are going to do this for months/years then reduce the cost by using Reserved Instances.
If you can tolerate periods where the GET requests don't happen then reduce the cost even further by using Spot instances.
That said, why do you need to issue a GET request every second? Perhaps there is a better solution here.
You can create a AWS Lambda function, which simply loops and issues the GET request every second, and exits after 240 requets (i.e. 4 minutes). Then create a CloudWatch event that fires every 4 minutes calling the Lambda function.
Every 4 minutes because the maximum timeout you can set for a Lambda function is 5 minutes.
This setup will likely incur only some trivial cost:
At 1 event per 4 minutes, it's $1/month for the CloudWatch events generated.
At 1 call per 4 minutes to a minimally configured (128MB) Lambda function, it's 324,000 GB-second worth of execution per month, just within the free tier of 400,000 GB-second.
Since network transfer into AWS is free, the response size of your GET request is irrelevant. And the first 1GB of transfer out to the Internet is free, which should cover all the GET requests themselves.
I have a large web-based application running in AWS with numerous EC2 instances. Occasionally -- about twice or thrice per week -- I receive an alarm notification from my Sensu monitoring system notifying me that one of my instances has hit 100% CPU.
This is the notification:
CheckCPU TOTAL WARNING: total=100.0 user=0.0 nice=0.0 system=0.0 idle=25.0 iowait=100.0 irq=0.0 softirq=0.0 steal=0.0 guest=0.0
Host: my_host_name
Timestamp: 2016-09-28 13:38:57 +0000
Address: XX.XX.XX.XX
Check Name: check-cpu-usage
Command: /etc/sensu/plugins/check-cpu.rb -w 70 -c 90
Status: 1
Occurrences: 1
This seems to be a momentary occurrence and the CPU goes back down to normal levels within seconds. So it seems like something not to get too worried about. But I'm still curious why it is happening. Notice that the CPU is taken up with the 100% IOWaits.
FYI, Amazon's monitoring system doesn't notice this blip. See the images below showing the CPU & IOlevels at 13:38
Interestingly, AWS says tells me that this instance will be retired soon. Might that be the two be related?
AWS is only displaying a 5 minute period, and it looks like your CPU check is set to send alarms after a single occurrence. If your CPU check's interval is less than 5 minutes, the AWS console may be rolling up the average to mask the actual CPU spike.
I'd recommend narrowing down the AWS monitoring console to a smaller period to see if you see the spike there.
I would add this as comment, but I have no reputation to do so.
I have noticed my ec2 instances have been doing this, but for far longer and after apt-get update + upgrade.
I tough it was an Apache thing, then started using Nginx in a new instance to test, and it just did it, run apt-get a few hours ago, then came back to find the instance using full cpu - for hours! Good thing it is just a test machine, but I wonder what is wrong with ubuntu/apt-get that might have cause this. From now on I guess I will have to reboot the machine after apt-get as it seems to be the only way to put it back to normal.
I cannot get alarms to work reliably for an AWS ec2 instance. I have a g2.xlarge image running and want it to stop when it is not in use, i.e. when average usage falls below 2%.
When I try 1 or 2 periods of 1 hour below 2% it usually works but then when I start it up again it immediately stops itself as it is in an alarm condition. I have tried 12 periods of 5 minutes which allows it to start ok but now it doesn't stop at all despite being in an alarm condition for several hours.
I have tried various options and can't nail down what makes it work and what doesn't. It feels as if sometimes things work and sometimes they don't. Is it buggy or am I missing something?
Here is a screenshot of my setup which has failed to trigger a shutdown...
I have a micro amazon EC2 instance, and whenever the hosted application at this platform is given a large load for a couple of hours, the application slows down and CPU credits reach almost to zero.
I have turned auto scaling option on but still it does not work can some help me to figure out how to get around this?
All t2 instances use a burstable model. Which is not really intended for sustained heavy usage. The instance, when idling, will build up CPU credits up to a cap. When the CPU is maxed, the credits are spent. Once you run out, you are capped at a very low rate. The amount of credits you can get and the rate at which you earn them depend on which t2 instance you are using.
Autoscaling is for horizontal scaling. With it you can launch extra instances based on certain triggers. But you need to use a load balancer to spread traffic accross instances.
To the question how you see on the CPU utilization if you are using all your credited CPU at 100%, my experience is you don't. What you see is in top or iostat for exmaple, your CPU% is reported at something quite low. Like 30%, while IO is not bottlenecked, and you wonder why it is stuck to low CPU% usage.
But there is a value you might see in top, at the far right end, something like "68% st" that is the "steal" value. This means that you only get to have 32% of that CPU, and so your 30% CPU value is actually 94% of what you get.
I have also observed that when you add the CPU% of the processes that are in running state (R) in top, you arrive at a number relative to your actually available CPU. For example, I had 24 processes running at 8% each on a t2.medium instance with 2 virtual CPUs, that means 192% actually running, that is 96% of the available CPU cycles, not 32% as reported by top and iostat as % user.
If I was to generate an auto scale trigger, I would look at what I can get from the /proc file system and consider the "steal" amount.
I am trying to set up a scalable background image processing using beanstalk.
My setup is the following:
Application server (running on Elastic Beanstalk) receives a file, puts it on S3 and sends a request to process it over SQS.
Worker server (also running on Elastic Beanstalk) polls the SQS queue, takes the request, load original image from S3, processes it resulting in 10 different variants and stores them back on S3.
These upload events are happening at a rate of about 1-2 batches per day, 20-40 pics each batch, at unpredictable times.
Problem:
I am currently using one micro-instance for the worker. Generating one variant of the picture can take anywhere from 3 seconds to 25-30 (it seems first ones are done in 3, but then micro instance slows down, I think this is by its 2 second bursty workload design). Anyway, when I upload 30 pictures that means the job takes: 30 pics * 10 variants each * 30 seconds = 2.5 hours to process??!?!
Obviously this is unacceptable, I tried using "small" instance for that, the performance is consistent there, but its about 5 seconds per variant, so still 30*10*5 = 26 minutes per batch. Still not really acceptable.
What is the best way to attack this problem which will get fastest results and will be price efficient at the same time?
Solutions I can think of:
Rely on beanstalk auto-scaling. I've tried that, setting up auto scaling based on CPU utilization. That seems very slow to react and unreliable. I've tried setting measurement time to 1 minute, and breach duration at 1 minute with thresholds of 70% to go up and 30% to go down with 1 increments. It takes the system a while to scale up and then a while to scale down, I can probably fine tune it, but it still feels weird. Ideally I would like to get a faster machine than micro (small, medium?) to use for these spikes of work, but with beanstalk that means I need to run at least one all the time, since most of the time the system is idle that doesn't make any sense price-wise.
Abandon beanstalk for the worker, implement my own monitor of of the SQS queue running on a micro, and let it fire up larger machine(or group of larger machines) when there are enough pending messages in the queue, terminate them the moment we detect queue is idle. That seems like a lot of work, unless there is a solution for this ready out there. In any case, I lose the benefits of beanstalk of deploying the code through git, managing environments etc.
I don't like any of these two solutions
Is there any other nice approach I am missing?
Thanks
CPU utilization on a micro instance is probably not the best metric to use for autoscaling in this case.
Length of the SQS queue would probably be the better metric to use, and the one that makes the most natural sense.
Needless to say, if you can budget for a bigger base-line machine everything would run that much faster.