AWS ServerlessV2 CPU usage is always high

AWS ServerlessV2 CPU usage is always high - amazon-web-services

I've been running a test stack: WordPress on ECS with a ServerlessV2 database with MinCapacity=0.5 and MaxCapacity=2 for a few weeks. This is a test site with nobody accessing it except me.
The load balancer's health check points to a dumb PHP script that only echos "pong" without ever connecting to the database.
To my surprise, the Aurora ServerlessV2 server was almost always between 1.5 and 2 ACUs. I changed the max to 1, and it's almost always at 1.
I killed the web servers. The DB instance has 0 connections, no queries, but still: it's at 1 ACU most of the time, and the CPU averages 55%-65% steadily.
All CloudWatch log exports are disabled; enhanced monitoring is disabled; performance insights are disabled. I'm running the latest 3.02.2 version.
Yet, my totally idle Aurora ServerlessV2 has 55%+ CPU with no connections and no queries and requires 1 ACU to do nothing because 0.5 ACU isn't enough.
I tested the same with a single t3.medium instance, and the CPU hovers between 2% and 8%.
Anyone has idea why ServerlessV2 needs so much CPU and ACU for doing absolutely nothing?
I would be expecting ServerlessV2 to use less than 20% of CPU and be steadily at 0.5 ACU when idle with no options (enhanced monitoring, performance insights) enabled.

Related

How can I optimise requests/second under peak load for Django, UWSGI and Kubernetes

We have an application that experiences some pretty short, sharp spikes - generally about 15-20mins long with a peak of 150-250 requests/second, but roughly an average of 50-100 requests/second over that time. p50 response times around 70ms (whereas p90 is around 450ms)
The application is generally just serving models from a database/memcached cluster, but also sometimes makes requests to 3rd party APIs etc (tracking/Stripe etc).
This is a Django application running with uwsgi, running on Kubernetes.
I'll spare you the full uwsgi/kube settings, but the TLDR:
# uwsgi
master = true
listen = 128 # Limited by Kubernetes
workers = 2 # Limited by CPU cores (2)
threads = 1
# Of course much more detail here that I can load test...but will leave it there to keep the question simple
# kube
Pods: 5-7 (horizontal autoscaling)
If we assume 150ms average response time, I'd roughly calculate a total capacity of 93requests/second - somewhat short of our peak. In our logs we often get uWSGI listen queue of socket ... full logs, which makes sense.
My question is...what are our options here to handle this spike? Limitations:
It seems the 128 listen queue is determined by the kernal, and the kube docs suggest it's unsafe to increase this.
Our Kube nodes have 2 cores. The general advice seems to be to set your number of workers to 2 * cores (possibly + 1), so we're pretty much at our limit here. Increasing to 3 doesn't seem to have much impact.
Multiple threads in Django can apparently cause weird bugs
Is our only option to keep scaling this horizontally at the kubernetes level? Aside from making our queries/caching as efficient as possible of course.

AWS Elasticache backed by memcached CPU usage flat line at 1%

I've created an ElastiCache cluster in AWS, with node type as t3.micro (500 MB, 2 vCPUs and network up to 5 gigabit). My current setup is having 3 nodes for High Availability, each node is in a different AZ.
I'm using the AWS labs memcached client for Java (https://github.com/awslabs/aws-elasticache-cluster-client-memcached-for-java) that allows auto discovery of nodes, i.e. I only need to provide the cluster DNS record and the client will automatically discover all nodes within that cluster.
I intermittently get some timeout errors:
1) Error in custom provider, net.spy.memcached.OperationTimeoutException: Timeout waiting for value: waited 2,500 ms. Node status: Connection Status { /XXX.XX.XX.XXX:11211 active: false, authed: true, last read: 44,772 ms ago /XXX.XX.XX.XXX:11211 active: true, authed: true, last read: 4 ms ago /XXX.XX.XX.XXX:11211 active: true, authed: true, last read: 6 ms ago
I'm trying to understand what's the problem, but nothing really stands out by looking at the CloudWatch metrics.
The only thing that looks a bit weird is the CPU utilization graph:
The CPU always maxes out at 1% during peak hours, so I'm trying to understand how to read this value and whether this is not a 1% but more of a 100%, indicating that there's a bottleneck on the CPU.
Any help on this?

Just one question. Why are using such small instances? How is the memory use. My guess is the same as yours. The CPU is causing the trouble. 3 micro instances are not much.
I would try to increase the instances. But it is just a guess.

How to improve web-service api throughput?

I'm new to creating web service. So I'd like to know what i'm missing out on performance (assuming i'm missing something).
I've build a simple flask app. Nothing fancy, it just reads from the DB and responds with the result.
uWSGI is used for WSGI layer. I've run multiple tests and set process=2 and threads=5 based on performance monitoring.
processes = 2
threads = 5
enable-threads = True
AWS ALB is used for load balancer. uWSGI and Flask app is dockerized and launched in ECS (3 container [1vCPU]).
The for each DB hit, the flask app takes 1 - 1.5 sec to get the data. There is no other lag on the app side. I know it can be optimised. But assuming that the request processing time takes 1 - 1.5 sec, can the throughput be increased?
The throughput I'm seeing is ~60 request per second. I feel it's too low. Is there any way to increase the throughput with the same infra ?
Am i missing something here or is the throughput reasonable given that the DB hit takes 1.5 sec?
Note : It's synchronous.

Load Testing SQL Alchemy: "TimeoutError: QueuePool limit of size 3 overflow 0 reached, connection timed out, timeout 30"

I have a SQL-Alchemy based web-application that is running in AWS.
The webapp has several c3.2xlarge EC2 instances (8 CPUs each) behind an ELB which take web requests and then query/write to the shared database.
The Database I'm using is and RDS instance of type: db.m4.4xlarge.
It is running MariaDB 10.0.17
My SQL Alchemy settings are as follows:
SQLALCHEMY_POOL_SIZE = 3
SQLALCHEMY_MAX_OVERFLOW = 0
Under heavy load, my application starts throwing the following errors:
TimeoutError: QueuePool limit of size 3 overflow 0 reached, connection timed out, timeout 30
When I increase the SQLALCHEMY_POOL_SIZE from 3 to 20, the error goes away for the same load-test. Here are my questions:
How many total simultaneous connections can my DB handle?
Is it fair to assume that Number of Number of EC2 instances * Number of Cores Per instance * SQLALCHEMY_POOL_SIZE can go up to but cannot exceed the answer to Question #1?
Do I need to know any other constraints regarding DB connection pool
sizes for a distributed web-app like mine?

MySQL can handle virtually any number of "simultaneous" connections. But if more than a few dozen are actively running queries, there may be trouble.
Without knowing what your queries are doing, one cannot say whether 3 is a limit or 300.
I recommend you turn on the slowlog to gather information on which queries are the hogs. A well-tuned web app can easily survive 99% of the time on 3 connections.
The other 1% -- well, there can be spikes. Because of this, 3 is unreasonably low.

high cpu in redis 2.8 (elasticache) cache.r3.large

looking for some help in ElasticCache
We're using ElasticCache Redis to run a Resque based Qing system.
this means it's a mix of sorted sets and Lists.
at normal operation, everything is OK and we're seeing good response times & throughput.
CPU level is around 7-10%, Get+Set commands are around 120-140K operations. (All metrics are cloudwatch based. )
but - when the system experiences a (mild) burst of data, enqueing several K messages, we see the server become near non-responsive.
the CPU is steady # 100% utilization (metric says 50, but it's using a single core)
number of operation drops to ~10K
response times are slow to a matter of SECONDS per request
We would expect, that even IF the CPU got loaded to such an extent, the throughput level would have stayed the same, this is what we experience when running Redis locally. redis can utilize CPU, but throughput stays high. as it is natively single-cored, not context switching appears.
AFAWK - we do NOT impose any limits, or persistence, no replication. using the basic config.
the size: cache.r3.large
we are nor using periodic snapshoting

This seems like a characteristic of a rouge lua script.
having a defect in such a script could cause a big CPU load, while degrading the overall throughput.
are you using such ? try to look in the Redis slow log for one

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js