I've created an ElastiCache cluster in AWS, with node type as t3.micro (500 MB, 2 vCPUs and network up to 5 gigabit). My current setup is having 3 nodes for High Availability, each node is in a different AZ.
I'm using the AWS labs memcached client for Java (https://github.com/awslabs/aws-elasticache-cluster-client-memcached-for-java) that allows auto discovery of nodes, i.e. I only need to provide the cluster DNS record and the client will automatically discover all nodes within that cluster.
I intermittently get some timeout errors:
1) Error in custom provider, net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { /XXX.XX.XX.XXX:11211 active: false, authed: true, last read: 44,772 ms ago /XXX.XX.XX.XXX:11211 active: true, authed: true, last read: 4 ms ago /XXX.XX.XX.XXX:11211 active: true, authed: true, last read: 6 ms ago
I'm trying to understand what's the problem, but nothing really stands out by looking at the CloudWatch metrics.
The only thing that looks a bit weird is the CPU utilization graph:
The CPU always maxes out at 1% during peak hours, so I'm trying to understand how to read this value and whether this is not a 1% but more of a 100%, indicating that there's a bottleneck on the CPU.
Any help on this?
Just one question. Why are using such small instances? How is the memory use. My guess is the same as yours. The CPU is causing the trouble. 3 micro instances are not much.
I would try to increase the instances. But it is just a guess.
Related
I've been running a test stack: WordPress on ECS with a ServerlessV2 database with MinCapacity=0.5 and MaxCapacity=2 for a few weeks. This is a test site with nobody accessing it except me.
The load balancer's health check points to a dumb PHP script that only echos "pong" without ever connecting to the database.
To my surprise, the Aurora ServerlessV2 server was almost always between 1.5 and 2 ACUs. I changed the max to 1, and it's almost always at 1.
I killed the web servers. The DB instance has 0 connections, no queries, but still: it's at 1 ACU most of the time, and the CPU averages 55%-65% steadily.
All CloudWatch log exports are disabled; enhanced monitoring is disabled; performance insights are disabled. I'm running the latest 3.02.2 version.
Yet, my totally idle Aurora ServerlessV2 has 55%+ CPU with no connections and no queries and requires 1 ACU to do nothing because 0.5 ACU isn't enough.
I tested the same with a single t3.medium instance, and the CPU hovers between 2% and 8%.
Anyone has idea why ServerlessV2 needs so much CPU and ACU for doing absolutely nothing?
I would be expecting ServerlessV2 to use less than 20% of CPU and be steadily at 0.5 ACU when idle with no options (enhanced monitoring, performance insights) enabled.
I'm seeing some errors on our AWS RDS MySQL server:
General error: 1205 Lock wait timeout exceeded; try restarting transaction
Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction
Looked at the RDS console monitoring tab, and there seems read IOPS is cut off 1, perhaps indicating that the disk IO is not keeping up with the requests. Funny thing is that write IOPS does not seem to be cut off 2. In general there's very few app server requests that fail due to the database error, but would like to get this sorted.
CPU load on the RDS server peaks around 50%. This makes me think the db.t3.small RDS size is sufficient.
The database is tiny, just 20GB and was created some years ago, so it's on magnetic storage. Have read that this means there's a limit of 200 IOPS, which matches the approx 150 + 50 IOPS peaks seen. I am therefore thinking about moving to General Purpose SSD. However for the small db this will only provide 100 IOPS as baseline performance according to the docs, but according to the docs, a burst load of 3000IOPS is possible.
Does this sound like a good move, and any other suggestions on what to do?
I have been running with General Purpose SSD for a couple of days now. The MySQL deadlock errors have not been seen since, so in case someone else finds the question, change from Magnetic to General Purpose SSD in RDS is certainly something to try out if you have similar problems.
I have a large web-based application running in AWS with numerous EC2 instances. Occasionally -- about twice or thrice per week -- I receive an alarm notification from my Sensu monitoring system notifying me that one of my instances has hit 100% CPU.
This is the notification:
CheckCPU TOTAL WARNING: total=100.0 user=0.0 nice=0.0 system=0.0 idle=25.0 iowait=100.0 irq=0.0 softirq=0.0 steal=0.0 guest=0.0
Host: my_host_name
Timestamp: 2016-09-28 13:38:57 +0000
Address: XX.XX.XX.XX
Check Name: check-cpu-usage
Command: /etc/sensu/plugins/check-cpu.rb -w 70 -c 90
Status: 1
Occurrences: 1
This seems to be a momentary occurrence and the CPU goes back down to normal levels within seconds. So it seems like something not to get too worried about. But I'm still curious why it is happening. Notice that the CPU is taken up with the 100% IOWaits.
FYI, Amazon's monitoring system doesn't notice this blip. See the images below showing the CPU & IOlevels at 13:38
Interestingly, AWS says tells me that this instance will be retired soon. Might that be the two be related?
AWS is only displaying a 5 minute period, and it looks like your CPU check is set to send alarms after a single occurrence. If your CPU check's interval is less than 5 minutes, the AWS console may be rolling up the average to mask the actual CPU spike.
I'd recommend narrowing down the AWS monitoring console to a smaller period to see if you see the spike there.
I would add this as comment, but I have no reputation to do so.
I have noticed my ec2 instances have been doing this, but for far longer and after apt-get update + upgrade.
I tough it was an Apache thing, then started using Nginx in a new instance to test, and it just did it, run apt-get a few hours ago, then came back to find the instance using full cpu - for hours! Good thing it is just a test machine, but I wonder what is wrong with ubuntu/apt-get that might have cause this. From now on I guess I will have to reboot the machine after apt-get as it seems to be the only way to put it back to normal.
I have a SQL-Alchemy based web-application that is running in AWS.
The webapp has several c3.2xlarge EC2 instances (8 CPUs each) behind an ELB which take web requests and then query/write to the shared database.
The Database I'm using is and RDS instance of type: db.m4.4xlarge.
It is running MariaDB 10.0.17
My SQL Alchemy settings are as follows:
SQLALCHEMY_POOL_SIZE = 3
SQLALCHEMY_MAX_OVERFLOW = 0
Under heavy load, my application starts throwing the following errors:
TimeoutError: QueuePool limit of size 3 overflow 0 reached, connection timed out, timeout 30
When I increase the SQLALCHEMY_POOL_SIZE from 3 to 20, the error goes away for the same load-test. Here are my questions:
How many total simultaneous connections can my DB handle?
Is it fair to assume that Number of Number of EC2 instances * Number of Cores Per instance * SQLALCHEMY_POOL_SIZE can go up to but cannot exceed the answer to Question #1?
Do I need to know any other constraints regarding DB connection pool
sizes for a distributed web-app like mine?
MySQL can handle virtually any number of "simultaneous" connections. But if more than a few dozen are actively running queries, there may be trouble.
Without knowing what your queries are doing, one cannot say whether 3 is a limit or 300.
I recommend you turn on the slowlog to gather information on which queries are the hogs. A well-tuned web app can easily survive 99% of the time on 3 connections.
The other 1% -- well, there can be spikes. Because of this, 3 is unreasonably low.
I just implemented a few alarms with CloudWatch last week and I noticed a strange behavior with EC2 small instances everyday between 6h30 and 6h45 (UTC time).
I implemented one alarm to warn me when a AutoScallingGroup have its CPU over 50% during 3 minutes (average sample) and another alarm to warn me when the same AutoScallingGroup goes back to normal, which I considered to be CPU under 30% during 3 minutes (also average sample). Did that 2 times: one for zone A, and another time for zone B.
Looks OK, but something is happening during 6h30 to 6h45 that takes certain amount of processing for 2 to 5 min. The CPU rises, sometimes trigger the "High use alarm", but always triggers the "returned to normal alarm". Our system is currently under early stages of development, so no user have access to it, and we don't have any process/backups/etc scheduled. We barely have Apache+PHP installed and configured, so it can only be something related to the host machines I guess.
Anybody can explain what is going on and how can we solve it besides increasing the sample time or % in the "return to normal" alarm? Folks at Amazon forum said the Service Team would have a look once they get a chance, but it's been almost a week with no return.