AWS Elasticache CPU usage exceeding 100% - amazon-web-services

We have been using AWS Elasticache for our applications. We had initially set a CPU alarm threshold for 22% (4 core node, so effectively 90% CPU usage), which is based on the recommended thresholds. But we often see the CPU utilization crossing well over 25% to values like 28%, 34%.
What I am trying to understand that how is this theoretically possible, considering Redis is single-threaded ? The only way I can think that this can happen is if there is maintenance operation happening on other cores, which can bump the CPU usage > 25%. Even if the cluster is highly loaded, it should cap CPU usage at 25% and probably start timing out for clients. Can someone help me understand under what scenarios can the CPU usage of a single-threaded Redis instance cross 100% CPU utilization ?

Redis event loop is single-threaded. the Redis process itself is not. There are a couple of extra threads to offline some I/O bound operations. Now, these threads should not consume CPU.
However, Redis also forks child processes to take care of heavy duty operations like AOF rewrite or RDB save. Each forked process generally consumes 100% of a CPU core (except if the operation is slowed down by I/Os), on top of the Redis event loop consumption.
If you find the CPU consumption regularly high, it may be due to a wrong AOF and RDB configuration (i.e. the Redis instance rewrites the AOF or generates a dump too frequently).

Related

what will happen if my virtual machine too slow

i have a newbie question in here, but i'm new to clouds and linux, i'm using google cloud now and wondering when choosing a machine config
what if my machine is too slow? will it make the app crash? or just slow it down
how fast should my vm be? in the image bellow
last 6 hours of a python scripts i'm running and it's cpu usage, it's obviously running for less than %2 of the cpu for most of it's time, but there's a small spike, should i care about the spike? and also, how much should my cpu usage be max before i upgrade? if a script i'm running is using 50-60% of the cpu most of the i assume i'm safe, or what's the max before you upgrade?
what if my machine is too slow? will it make the app crash? or just
slow it down
It depends.
Some applications will just respond slower. Some will fail if they have timeout restrictions. Some applications will begin to thrash which means that all of a sudden the app becomes very very slow.
A general rule, which varies among architects, is to never consume more than 80% of any resource. I use the rule 50% so that my service can handle burst traffic or denial of service attempts.
Based on your graph, your service is fine. The spike is probably normal system processing. If the spike went to 100%, I would be concerned.
Once your service consumes more than 50% of a resource (CPU, memory, disk I/O, etc) then it is time to upgrade that resource.
Also, consider that there are other services that you might want to add. Examples are load balancers, Cloud Storage, CDNs, firewalls such as Cloud Armor, etc. Those types of services tend to offload requirements from your service and make your service more resilient, available and performant. The biggest plus is your service is usually faster for the end user. Some of those services are so cheap, that I almost always deploy them.
You should choose machine family based on your needs. Check the link below for details and recommendations.
https://cloud.google.com/compute/docs/machine-types
If CPU is your concern you should create a managed instance group that automatically scales based on CPU usage. Usually 80-85% is a good value for a max CPU value. Check the link below for details.
https://cloud.google.com/compute/docs/autoscaler/scaling-cpu
You should also consider the availability needed for your workload to keep costs efficient. See below link for other useful info.
https://cloud.google.com/compute/docs/choose-compute-deployment-option

Why does PoCo HTTP server consume CPU on complete idle

I've experimented with PoCo HTTP server and found it consumes some CPU even on complete idle. This is not high usage but if we have a lot of instances running it may become a problem.
For network services using poll it's normal to permanently use small amount of cpu time. Nginx and redis also have some cpu consumption on idle. To achieve zero cpu usage on idle you well have to use another approach to network communications.

AWS EC2 Performance explanation

I have a REST API web server, built in .NetCore, that has data heavy APIs.
This is hosted on AWS EC2, I have noticed that the average response time for certain APIs are ~4 seconds and if I turn up the AWS-EC2 specs, the response time goes down to a few milliseconds. I guess this is expected, what I don't understand is that even when I load test the APIs on a lower end CPU, the server never crosses 50% utilization of memory/CPU. So what is the correct technical explanation that makes the APIs perform faster if the lower end CPU never reaches a 100% utilization of memory/CPU?
There is no simple answer, there are so many ec2 variations you need to first figure out what is slowing down your API.
When you 'turn up' your ec2 instance, you are getting some combination of more memory, faster cpu, faster disk and more network bandwidth - and we can't tell which one of those 'more' features are improving your performance. Different instance classes ar optimized for different problems.
It could be as simple as the better network bandwidth, or it could be that your application is disk-bound and the better instance you chose is optimized for i/O performance.
Depending on what feature your instance is lacking, it would help you decide which type of instance to upgrade to - or as you have found out, just upgrade to something 'bigger' and be happy with the performance (at the tradeoff of being more expensive).

Celery using only 20% of Cpu (at peak)

I'm running a celery + rabbitmq app. I start up a bunch of ec2 machines, but I find that my celery worker machines only use about 15% cpu (peak of 20%). I've configured 2 celery workers per machine.
Shouldn't celery workers be close to using 100% CPU utilization?
MORE INFO: I am not using the celery --concurrency option or eventlet even though I am using multiple workers. By default concurrency is set to 8. My tasks run in php mostly io blocking, so there won't be an issue if we have more processes running in parallel. Is there any way to configure celery to run more number of tasks based on the CPU usage
Shouldn't celery workers be close to using 100% CPU utilization?
Only if you load them up to utilize 100% CPU :)
My tasks run in php mostly io blocking
If your tasks are primarily making IO calls than this is most likely the reason why CPU isn't high. Ie when a process/theads is mostly sitting idle after making an io call and waiting for it to complete.
It's crucial to benchmark your configuration. In practice this could look like:
choose an initial level for concurrency (ie the default)
Benchmark throughput / resource usage
Increase the concurrency Level
Benchmark Throughput / resource usage
Continue until increasing concurrency no longer provides any benefit
If your worker tasks are IO bound this is a perfect case for eventlet. Since it will allow you to run many many IO bound tasks on a single processor. Ie consider the case where your machine has 64 cores. You should easily be able to run some multiple of this for IO bound tasks but at some point majority of resources will go to process accounting and overhead and context switching.
With eventlet, a single processor could handle hundreds or thousands of concurrent workers:
The prefork pool can take use of multiple processes, but how many is
often limited to a few processes per CPU. With Eventlet you can
efficiently spawn hundreds, or thousands of green threads. In an
informal test with a feed hub system the Eventlet pool could fetch and
process hundreds of feeds every second, while the prefork pool spent
14 seconds processing 100 feeds. Note that this is one of the
applications async I/O is especially good at (asynchronous HTTP
requests). You may want a mix of both Eventlet and prefork workers,
and route tasks according to compatibility or what works best.
You have two options - to increase concurrency level (using the --concurrency), or to use the (deprecated) auto-scaling option. Most of the time we overutilise on AWS by using concurrency setting number that is 2 * N where N is number of vCPUs on the instance type of your choice. We do not overutilise nodes that are subscribed to the special queue where we send our CPU-bound tasks.

aws rds 100% cpu 2 vcores

i currently use T2.Micro RDS with SQL Express.
Due to a heavy load application running, there might be times that 1 request of a visitor might take 30 seconds to complete. This makes the RDS work 100% CPU. The result is any other visitor that goes to the website same time and during 100% CPU load, the website takes much longer to answer.
T2.micro has 1 vCPU.
I'm thinking of upgrade to T2.medium with has 2 vCPU.
The question is, if i have 2 vCPU will i avoid the bottleneck?
Example, 1st visitor with 30 second request, uses vCPU #1 and second visitor comes same time, is he using vCPU #2 ? Will that help my situation ?
Also, i did not see any option in aws rds to see what CPU is that. Is there option to choose faster vCPU somehow ?
Thank you.
The operating system's scheduler automatically handles the distribution of running threads across all the available cores, to get as much work done as possible in the least amount of time.
So, yes, a multi-core machine should improve performance as long as more than one query is running. If a single, CPU-intensive, long-running query -- and nothing else -- is running on a 2-core machine, the maximum CPU utilization you'd probably see would be about 50%... but as long as there is more than one query running, each of them will be running on one of the cores at a time, and the system can actually move a thread anong the available cores as the workload shifts, to put them on the optimum core.
A t2.micro is a very small server, but t2 is a good value proposition. With all the t2-class machines, you aren't allowed to run 100% CPU continuously, regardless of the number of cores, unless you have a sufficient CPU credit balance available. This is why the t2 is so inexpensive. You need to keep an eye on this metric as well. CPU credits are earned automatically over time, and spent by using CPU. A second motivation for upscaling a t2 machine is that larger t2 instances earn these credits at a faster rate than smaller ones.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html