Cloud Memorystore Redis high CPU utilisation - google-cloud-platform

We are using Cloud Memorystore Redis instance to add a caching layer to our mission critical Internet facing application. Total number of calls (including get, set and key expiry operations) to Memorystore instance is around 10-15K per second. CPU utilisation has been consistently around 75-80% and expecting the utilisation to go even higher.
Currently, we are using M4 capacity tier under Standard service tier.
https://cloud.google.com/memorystore/docs/redis/pricing
Need some clarity around the following pointers.
How many CPU cores do the M4 capacity tier correspond to?
Is it really alarming to have more than 100% CPU utilisation? Do we expect any noticeable performance issues?
What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)? Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues.
Our application is really CPU intensive and we don't see any way to further optimize our application. Looking forward to some helpful references.

Addressing your questions.
1. How many CPU cores do the M4 capacity tier correspond to?
Cloud Memorystore for Redis is a Google-managed service which means that Google can reserve the inner details(resources) of the virtual machine that is running the redis service. Still it is expected that the higher the capacity tier, the more resources(CPU) the virtual machine will have. For your case in particular, adding CPUs will not solve issues around CPU usage because redis service itself is single threaded.
As you can see from the previous link:
To maximize CPU usage you can start multiple instances of Redis.
If you want to use multiple CPUs you can start thinking of some way to shard earlier.
2. Is it really alarming to have more than 100% CPU utilisation?
Yes, it is alarming to have high CPU utilization because it can result in connection errors or high latency.
CPU utilization is important but also whether the Redis instance is efficient enough to sustain your throughput at a given latency. You can check the redis latency with the command redis-cli --latency while CPU % is high.
3. Do we expect any noticeable performance issues?
This is really hard to say or predict because it depends on several factor(client service, commands run within a time frame, workload). Some of the most common causes for high latency and performance issues are:
Client VMs or services are overloaded and not consuming the messages from Redis: When a client opens a TCP connection to redis then the redis server has a buffer of messages to send to that connection. If a client service has its CPU maxed out, giving no time for the kernel to receive messages from redis then they fill up on the redis server.
The commands executed are consuming a lot of CPU: The following commands are known to be potentially very expensive to process:
EVAL/EVALSHA
KEYS
LRANGE
ZRANGE/ZREVRANGE
4.-What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)?
This question revolves mainly around the scaling design of your implementation. Since redis is single threaded, a better approach to reduce CPU % would be by sharding your data in multiple redis instances and have a proxy in front of it to distribute the load. Please take a look at the graph under section Twemproxy from this link.
5.-Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues?
Switching to a higher capacity tier should help with the latency temporarily but this is known as vertical scaling which is limited to the tiers that Cloud Memorystore offers.

Redis Enterprise solves all the issues you are facing. Redis Enterprise can be configured in a clustered configuration and utilize all the resources of the machine as well as scale out over multiple machines.
The Redis Enterprise Software is responsible for watching over the CPU utilization and other resource management tasks so you do not need to.
It is offered on GCP and GCP marketplace as well.
https://redis.com/redis-enterprise-cloud/pricing/

Related

Latency in Cloud Foundry

Questions:
How do you define latency in cloud foundry?
Is cloud foundry a distributed cloud ?
High Load(Transferring oversized files via Rest call) on one of the app in cloud foundry will impact other apps performance? if yes how?
How is latency calculated for over all cloud network traffic and latency? and any metrics that can be used to determine how is current network latency doing?
Thanks in advance!
How do you define latency in cloud foundry?
The same way it's defined elsewhere. In regards to application traffic on CF, the system will add latency because traffic to your application routes through typically two (external load balancer and Gorouter) or more load balancer layers (optionally additional external load balancers).
Each layer takes some amount of time to process the request, which means each layer adds some amount of latency to the request.
Is cloud foundry a distributed cloud ?
It's a distributed system. Individual components of CF can be scaled as needed (i.e. Gorouter or UAA or Cloud Controller, they are all separate). Not sure what's meant beyond that.
High Load(Transferring oversized files via Rest call) on one of the app in cloud foundry will impact other apps performance? if yes how?
High CPU load in one application can impact the performance of other applications in some ways, however, Cloud Foundry has mitigations in place to generally minimize the impact.
Specifically, an application running on CF will be given a certain number of CPU shares, and the shares ensure a minimal amount of guaranteed CPU time for that app. If there is contention for CPU, then the OS (i.e. Linux kernal) will enforce these limits. If there is no contention, then applications may burst above their allocations and consume extra time.
Where you typically see performance impact caused by load from other applications is when you have an application that is used to consuming or perhaps load tested while consuming additional CPU (i.e. it expects to be able to burst above their assigned limits). This can be a problem because while you'll often be able to burst above the CPU limit, if you suddenly have CPU contention from some other app that now requires its fair share of CPU time, then the limits will be enforced and the app original won't be able to burst above its limits. This is an example of how high load in one app can impact the performance of another application on the platform, though it is no fault of the platform that causes this. The application owner should be sizing CPU for worst case, not the best case.
You can use the cpu entitlement cf cli plugin to get more details on your app's CPU consumption and if your app is bursting above its entitlement. If you are above the entitlement, then you need to increase the memory limit for your app because CPU shares are directly tied to the memory limit of your app in CF (i.e. there is no way to increase just the CPU shares).
How is latency calculated for over all cloud network traffic and latency? and any metrics that can be used to determine how is current network latency doing?
Again, the same way it's calculated elsewhere. It's the time delay added for the system to process the request.

Google Cloud Kubernates nodes bandwidth

I'm deploying my cluster on Google Cloud Kubernates service. It already has a few nodes. Also, I need the server with GPU from Google Cloud to make it work with my cluster. GPU instance continuously processes the incoming traffic (bandwidth should be up to 1Gb/s) and sends the results on cluster nodes (bandwidth should be even more than incoming bandwidth).
The most critical things for me in the project:
1) bandwidth between these nodes inside cluster;
2) bandwidth between the node and the GPU server;
3) bandwidth between the GPU server and the world;
4) bandwidth between the node and world.
The minimum appropriate bandwidth for each node is 1 Gb/s on downloading and uploading both. When I make speed tests, it shows download speed 100-680 Mb/s and upload speed 67-138 Mb/s for the same node on the same time (screenshots below were made in period 30 seconds between each other). So the current bandwidth is too small and unstable. But I need stable bandwidth starting from 1 Gb/s.
I tried to find any technical specification or pricing on bandwidth in Google Docs. But, there are only CPU/GPU/RAM/Disk, not bandwidth in the technical specification. And there is only traffic per month pricing on docs.
TL;DR:
How can I set stable 1 Gb/s or more bandwidth for each of the cluster nodes, GPU instance and any other Google Cloud virtual machine?
Is there any service in Google Cloud that provides bandwidth of more than 1 Gb/s?
Is there any solution/service in Google Cloud how to handle big Internet traffic?
P.S. speed tests were made via:
npx speedo-cli
There's no guarantee really, especially when it comes to traffic to/from networks outside GCP. Here's a few things you can do to maximize bandwidth though:
Increase the number of CPU cores per instance:
caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. source
Note that the 2 Gbps per vCPU cap represents a theoretical limit using internal networks:
The cap is a limit that can't be exceeded and doesn't indicate the actual throughput of your egress traffic. There is no guarantee that your traffic will achieve the maximum throughput, which depends on many factors other than the cap. source
In case of traffic between VMs (i.e., cases 1 and 2 in your question) make sure the VMs are located in the same zone and you're using internal IPs:
Any time you transfer data or communicate between VMs, you can achieve max performance by always using the internal IP to communicate. In many cases, the difference in speed can be drastic. source
Check this answer on how to measure network bandwidth between VMs using iperf.
For advanced use cases you can try fine-tuning the TCP window size in your VMs.
Finally, one benchmark observed that the GCP network throughput is 81x more variable when compared to AWS. Naturally this just reflects one benchmark but you might find it worthwhile to test other providers yourself.
Since Aleksi's answer there have been some changes to the per-VM egress cap/throttle. It is still computed as 2 Gbit/s * NumberOfvCPUs, but the maximum is now 32 Gbit/s (when the VM is created with min_cpu_platform of skylake or better) and there is a minimum of 10 Gbit/s for VMs with 2 or more vCPUs.
It wasn't clear to me what the endpoints were for your speed test, but one of the (many) limits to the throughput of a TCP connection is:
Throughput <= WindowSize / RoundTripTime
One would expect the GPU instance and the node(s) would be located close to one another, but that limit may come into play for GPU instance and node to the world.
Beyond that, understanding what was happening for the variable throughput likely calls for packet traces, definitely at the sending side, preferably at the receiving side as well. Just the first 96 bytes of each packet would be sufficient in this sort of case. It would be one of the things a support organization would request.
I fear that you can't have any bandwith commitments in mutualized infrastructure. If you have (a lot of) cash, using sole-tenant[1] with all the parts of your architecture on the same tenant can help to solve external parasite.
But event in this case, there is no commitment on network bandwith. And, for now, GPU aren't supported in this solution.
1: https://cloud.google.com/compute/docs/nodes/

What is the number of cores in aws.data.highio.i3 elastic cloud instance given for a 14 day trial period?

I wanted to make some performance calculations hence i need to know the number of cores that this aws.data.highio.i3 instance deployed by elastic cloud on aws has, I know that it has 4 GB of ram so if anyone can help me with the number of cores that would be really very helpfull.
I am working on elasticsearch deployed on elastic cloud and my use case requires me to make approx 40 million writes in a day so if you can help me suggest what machines i must use that can work accordingly to my use case and are I/O optimized as well.
The instance used by Elastic Cloud for aws.data.highio.i3 in the background is i3.8xlarge, see here. That means it has 32 virtual CPUs or 16 cores, see here.
But you down own the instance in Elastic Cloud, from reference hardware page:
Host machines are shared between deployments, but containerization and
guaranteed resource assignment for each deployment prevent a noisy
neighbor effect.
Each ES process runs on a large multi-tenant server with resources carved out using cgroups, and ES scales the thread pool sizing automatically. You can see the number of times that the CPU was throttled by the cgroups if you go to Stack Monitoring -> Advanced and down to graphs Cgroup CPU Performance and Cgroup CFS Stats.
That being said, if you need full CPU availability all the time, better go with AWS Elasticsearch service or host your own cluster.

Capacity planning on AWS

I need some understanding on how to do capacity planning for AWS and what kind of infrastructure components to use. I am taking the below example.
I need to setup a nodejs based server which uses kafka, redis, mongodb. There will be 250 devices connecting to the server and sending in data every 10 seconds. Size of each data packet will be approximately 10kb. I will be using the 64bit ubuntu image
What I need to estimate,
MongoDB requires atleast 3 servers for redundancy. How do I estimate the size of the VM and EBS volume required e.g. should be m4.large, m4.xlarge or something else? Default EBS volume size is 30GB.
What should be the size of the VM for running the other application components which include 3-4 processes of nodejs, kafka and redis? e.g. should be m4.large, m4.xlarge or something else?
Can I keep just one application server in an autoscaling group and increase as them as the load increases or should i go with minimum 2
I want to generally understand that given the number of devices, data packet size and data frequency, how do we go about estimating which VM to consider and how much storage to consider and perhaps any other considerations too
Nobody can answer this question for you. It all depends on your application and usage patterns.
The only way to correctly answer this question is to deploy some infrastructure and simulate standard usage while measuring the performance of the systems (throughput, latency, disk access, memory, CPU load, etc).
Then, modify the infrastructure (add/remove instances, change instance types, etc) and measure again.
You should certainly run a minimal deployment per your requirements (eg instances in separate Availability Zones for High Availability) and you can use Auto Scaling to add extra capacity when required, but simulated testing would also be required to determine the right triggers points where more capacity should be added. For example, the best indicator might be memory, or CPU, or latency. It all depends on the application and how it behaves under load.

Amazon AWS micro instance with 100% cpu and unresponsive

I've been having a problem with my aws ec2 ubuntu instances, they always have a 100% cpu utilization over certain amount of time (around 8 hours) until i restart it.
The instance is ubuntu server 13.04 and it has a basic LAMP, that's all.
I have a cron job to do a ping every couple minutes to keep a VPN tunnel up, but it shouldn't be doing this.
When it's at 100% cpu utilization i can't ping it, ssh into it or browse it, but it doesn't rejects the connection, it just keep "trying".
Any idea what's the reason behind it? Im guessing it has something to do with Amazon throttling the instance, but it's weird that it has a 100% cpu use over 8 hours.
This is the CPU log of the instance, every other indicator seems normal.
I cant attach images here, so i'm posting a link
100% cpu utilization
EDIT
This happened to me before with other instances, and right now i have an Amazon Linux AMI running at 100% for 4 days straight now, and that one only has tomcat, with no apps deployed. I just realized, its unresponsive, im terminating it.
Author's note, 2019: this post was originally written in 2013, and is about the t1.micro instance type. The current EC2 free tier now allows you to choose either the t1.micro or t2.micro instance class. Unlike the t1.micro's intermittent hard-clamping behavior, the t2.micro runs continuously at full capacity until your CPU credit balance nears depletion, and degrades much more gracefully.
This is the expected behavior. See t1.micro Instances in the EC2 User Guide for Linux Instances.
Note the graphs that say "CPU level limited." I have measured this, and if you consume 100% cpu on a micro instance for more than about 15 seconds, the throttling kicks in and your available cycles drop from 2 ECU to approximately 0.2 ECU (roughly 200MHz) for the next 2-3 minutes, at which point the cycle repeats and you'll be throttled again in just a few seconds if you are still pulling hard on the processor.
During throttle time, you only get ~1/10th of the cycles compared to when you are getting peak performance, because the hypervisor "steals" the rest¹... so you are still going to see that you were using a solid 100%... because you were using all that were available. It doesn't take much to pin a micro to the ceiling. Or floor... so either you are asking too much of the instance class or you have something unexpectedly maxing your CPU.
Establish an SSH connection while the machine is responsive, start "top" running, and then stay connected, so that when it starts to slow down, you already have the tool going that you need to use to find out what's the cpu hog.
¹ hypervisor steals the rest: A common misconception at one time was that the time stolen from EC2 instances by the hypervisor (visible in top and similar utilities) was caused by "noisy neighbors" -- other instances on the same hardware competing for CPU cycles. This is not the cause of stolen cycles. For some older instance families, like the m1, stolen cycles would be seen if AWS had provisioned your instance on a host machine that had faster processors than those specified for the instance class; the cycles were stolen so that the instance had performance matching what you were paying for, rather than the performance of the actual underlying hardware. EC2 instances don't share the physical resourced underlying your virtualized CPU resources.
Run top and see how high st (or steal) is. If st is at 97%, then you are being throttled and only have 3% of your CPU to work with. You don't need to be doing anything CPU intensive for that to be slow!
If that is the case and you cannot change how much CPU you require, the only fix is upgrade to a small instance. Small instances do not have as much throttling.
http://theon.github.io/you-may-want-to-drop-that-ec2-micro-instance.html