Google Cloud Kubernates nodes bandwidth - google-cloud-platform

I'm deploying my cluster on Google Cloud Kubernates service. It already has a few nodes. Also, I need the server with GPU from Google Cloud to make it work with my cluster. GPU instance continuously processes the incoming traffic (bandwidth should be up to 1Gb/s) and sends the results on cluster nodes (bandwidth should be even more than incoming bandwidth).
The most critical things for me in the project:
1) bandwidth between these nodes inside cluster;
2) bandwidth between the node and the GPU server;
3) bandwidth between the GPU server and the world;
4) bandwidth between the node and world.
The minimum appropriate bandwidth for each node is 1 Gb/s on downloading and uploading both. When I make speed tests, it shows download speed 100-680 Mb/s and upload speed 67-138 Mb/s for the same node on the same time (screenshots below were made in period 30 seconds between each other). So the current bandwidth is too small and unstable. But I need stable bandwidth starting from 1 Gb/s.
I tried to find any technical specification or pricing on bandwidth in Google Docs. But, there are only CPU/GPU/RAM/Disk, not bandwidth in the technical specification. And there is only traffic per month pricing on docs.
TL;DR:
How can I set stable 1 Gb/s or more bandwidth for each of the cluster nodes, GPU instance and any other Google Cloud virtual machine?
Is there any service in Google Cloud that provides bandwidth of more than 1 Gb/s?
Is there any solution/service in Google Cloud how to handle big Internet traffic?
P.S. speed tests were made via:
npx speedo-cli

There's no guarantee really, especially when it comes to traffic to/from networks outside GCP. Here's a few things you can do to maximize bandwidth though:
Increase the number of CPU cores per instance:
caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. source
Note that the 2 Gbps per vCPU cap represents a theoretical limit using internal networks:
The cap is a limit that can't be exceeded and doesn't indicate the actual throughput of your egress traffic. There is no guarantee that your traffic will achieve the maximum throughput, which depends on many factors other than the cap. source
In case of traffic between VMs (i.e., cases 1 and 2 in your question) make sure the VMs are located in the same zone and you're using internal IPs:
Any time you transfer data or communicate between VMs, you can achieve max performance by always using the internal IP to communicate. In many cases, the difference in speed can be drastic. source
Check this answer on how to measure network bandwidth between VMs using iperf.
For advanced use cases you can try fine-tuning the TCP window size in your VMs.
Finally, one benchmark observed that the GCP network throughput is 81x more variable when compared to AWS. Naturally this just reflects one benchmark but you might find it worthwhile to test other providers yourself.

Since Aleksi's answer there have been some changes to the per-VM egress cap/throttle. It is still computed as 2 Gbit/s * NumberOfvCPUs, but the maximum is now 32 Gbit/s (when the VM is created with min_cpu_platform of skylake or better) and there is a minimum of 10 Gbit/s for VMs with 2 or more vCPUs.
It wasn't clear to me what the endpoints were for your speed test, but one of the (many) limits to the throughput of a TCP connection is:
Throughput <= WindowSize / RoundTripTime
One would expect the GPU instance and the node(s) would be located close to one another, but that limit may come into play for GPU instance and node to the world.
Beyond that, understanding what was happening for the variable throughput likely calls for packet traces, definitely at the sending side, preferably at the receiving side as well. Just the first 96 bytes of each packet would be sufficient in this sort of case. It would be one of the things a support organization would request.

I fear that you can't have any bandwith commitments in mutualized infrastructure. If you have (a lot of) cash, using sole-tenant[1] with all the parts of your architecture on the same tenant can help to solve external parasite.
But event in this case, there is no commitment on network bandwith. And, for now, GPU aren't supported in this solution.
1: https://cloud.google.com/compute/docs/nodes/

Related

Cloud Memorystore Redis high CPU utilisation

We are using Cloud Memorystore Redis instance to add a caching layer to our mission critical Internet facing application. Total number of calls (including get, set and key expiry operations) to Memorystore instance is around 10-15K per second. CPU utilisation has been consistently around 75-80% and expecting the utilisation to go even higher.
Currently, we are using M4 capacity tier under Standard service tier.
https://cloud.google.com/memorystore/docs/redis/pricing
Need some clarity around the following pointers.
How many CPU cores do the M4 capacity tier correspond to?
Is it really alarming to have more than 100% CPU utilisation? Do we expect any noticeable performance issues?
What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)? Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues.
Our application is really CPU intensive and we don't see any way to further optimize our application. Looking forward to some helpful references.
Addressing your questions.
1. How many CPU cores do the M4 capacity tier correspond to?
Cloud Memorystore for Redis is a Google-managed service which means that Google can reserve the inner details(resources) of the virtual machine that is running the redis service. Still it is expected that the higher the capacity tier, the more resources(CPU) the virtual machine will have. For your case in particular, adding CPUs will not solve issues around CPU usage because redis service itself is single threaded.
As you can see from the previous link:
To maximize CPU usage you can start multiple instances of Redis.
If you want to use multiple CPUs you can start thinking of some way to shard earlier.
2. Is it really alarming to have more than 100% CPU utilisation?
Yes, it is alarming to have high CPU utilization because it can result in connection errors or high latency.
CPU utilization is important but also whether the Redis instance is efficient enough to sustain your throughput at a given latency. You can check the redis latency with the command redis-cli --latency while CPU % is high.
3. Do we expect any noticeable performance issues?
This is really hard to say or predict because it depends on several factor(client service, commands run within a time frame, workload). Some of the most common causes for high latency and performance issues are:
Client VMs or services are overloaded and not consuming the messages from Redis: When a client opens a TCP connection to redis then the redis server has a buffer of messages to send to that connection. If a client service has its CPU maxed out, giving no time for the kernel to receive messages from redis then they fill up on the redis server.
The commands executed are consuming a lot of CPU: The following commands are known to be potentially very expensive to process:
EVAL/EVALSHA
KEYS
LRANGE
ZRANGE/ZREVRANGE
4.-What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)?
This question revolves mainly around the scaling design of your implementation. Since redis is single threaded, a better approach to reduce CPU % would be by sharding your data in multiple redis instances and have a proxy in front of it to distribute the load. Please take a look at the graph under section Twemproxy from this link.
5.-Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues?
Switching to a higher capacity tier should help with the latency temporarily but this is known as vertical scaling which is limited to the tiers that Cloud Memorystore offers.
Redis Enterprise solves all the issues you are facing. Redis Enterprise can be configured in a clustered configuration and utilize all the resources of the machine as well as scale out over multiple machines.
The Redis Enterprise Software is responsible for watching over the CPU utilization and other resource management tasks so you do not need to.
It is offered on GCP and GCP marketplace as well.
https://redis.com/redis-enterprise-cloud/pricing/

What is the number of cores in aws.data.highio.i3 elastic cloud instance given for a 14 day trial period?

I wanted to make some performance calculations hence i need to know the number of cores that this aws.data.highio.i3 instance deployed by elastic cloud on aws has, I know that it has 4 GB of ram so if anyone can help me with the number of cores that would be really very helpfull.
I am working on elasticsearch deployed on elastic cloud and my use case requires me to make approx 40 million writes in a day so if you can help me suggest what machines i must use that can work accordingly to my use case and are I/O optimized as well.
The instance used by Elastic Cloud for aws.data.highio.i3 in the background is i3.8xlarge, see here. That means it has 32 virtual CPUs or 16 cores, see here.
But you down own the instance in Elastic Cloud, from reference hardware page:
Host machines are shared between deployments, but containerization and
guaranteed resource assignment for each deployment prevent a noisy
neighbor effect.
Each ES process runs on a large multi-tenant server with resources carved out using cgroups, and ES scales the thread pool sizing automatically. You can see the number of times that the CPU was throttled by the cgroups if you go to Stack Monitoring -> Advanced and down to graphs Cgroup CPU Performance and Cgroup CFS Stats.
That being said, if you need full CPU availability all the time, better go with AWS Elasticsearch service or host your own cluster.

Is there any way to increase the upload speed of Google Compute Engine?

I'm setting up a Plex server on my Google Cloud Platform instance, but media files are stalling since the Google Cloud upload rate does not exceed the 10 mbps mark. I live in Timon, Maranhão and the nearest Google server is in São Paulo in Brazil, the download rate reaches 1Gbps and the lenght is 60ms, plus the upload is only 10mbps .. Using the Speedtest.net site for the speed test. Could someone help me to improve the upload speed?
I think that the limitation is not on Google side since Google Cloud Platform does not impose bandwidth caps for ingress traffic, the amount of ingress traffic a GCE instance can handle, depends on the machine type and operating system.
On the other hand, the outbound or egress traffic from a virtual machine is subject to maximum network egress throughput caps. These caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine.
You can see this yourself by setting up a bunch of instance types, and logging their IPerf performance. As #John Hanley states Google does not have any guarantees for performance via the Internet, since upload speeds can vary based on a number of conditions, including the ISP that is using from on-premises.
One of the (many) limits to the performance of a TCP connection is:
Throughput <= WindowSize / RoundTripTime
So it is possible that the window size your local system and the instance in GCP will provide in the upload direction needs to be increased to accommodate the round-trip-time between the two. What do you see for the RoundTripTime? For example, if you "ping" your instance from your local system what does it say about the round-trip-time?
Also, it is not enough for the receiver to advertise that much window, the sender must be willing/able to send that much. So, both sides may need to be tweaked.
Further, there has been a change in the computation of the per-VM network egress cap since Raul's answer. It is still computed as 2 Gbit/s multiplied by the number of vCPUs in the instance, but the upper bound (if you specify Skylake or better for the CPU family) is now 32 Gbit/s, and there is now a lower bound of 10 Gbit/s for instances with 2 or more vCPUs. That cap is applied to the VM as as whole. As before, those are "guaranteed not to exceed" not "guaranteed to achieve."
For future readers:
Alternatively, you can upload files in google storage. Then ssh to the server and download the file gcloud storage cp gs://BUCKET_NAME/OBJECT_NAME SAVE_TO_LOCATION
Optionally, you can zip those files before uploading to google cloud storage
gcloud storage CLI:
https://cloud.google.com/storage/docs/downloading-objects#cli-download-object

Capacity planning on AWS

I need some understanding on how to do capacity planning for AWS and what kind of infrastructure components to use. I am taking the below example.
I need to setup a nodejs based server which uses kafka, redis, mongodb. There will be 250 devices connecting to the server and sending in data every 10 seconds. Size of each data packet will be approximately 10kb. I will be using the 64bit ubuntu image
What I need to estimate,
MongoDB requires atleast 3 servers for redundancy. How do I estimate the size of the VM and EBS volume required e.g. should be m4.large, m4.xlarge or something else? Default EBS volume size is 30GB.
What should be the size of the VM for running the other application components which include 3-4 processes of nodejs, kafka and redis? e.g. should be m4.large, m4.xlarge or something else?
Can I keep just one application server in an autoscaling group and increase as them as the load increases or should i go with minimum 2
I want to generally understand that given the number of devices, data packet size and data frequency, how do we go about estimating which VM to consider and how much storage to consider and perhaps any other considerations too
Nobody can answer this question for you. It all depends on your application and usage patterns.
The only way to correctly answer this question is to deploy some infrastructure and simulate standard usage while measuring the performance of the systems (throughput, latency, disk access, memory, CPU load, etc).
Then, modify the infrastructure (add/remove instances, change instance types, etc) and measure again.
You should certainly run a minimal deployment per your requirements (eg instances in separate Availability Zones for High Availability) and you can use Auto Scaling to add extra capacity when required, but simulated testing would also be required to determine the right triggers points where more capacity should be added. For example, the best indicator might be memory, or CPU, or latency. It all depends on the application and how it behaves under load.

Ensuring consistent network throughput from AWS EC2 instance?

I have created few AWS EC2 instances, however, sometimes, my data throughput (both for upload and download) are becoming highly limited on certain servers.
For example, typically I have about 15-17 MB/s throughput from instance located in US West (Oregon) server. However, sometimes, especially when I transfer a large amount of data in a single day, my throughput drops to 1-2 MB/s. When it happens on one server, the other servers have a typical network throughput (as previously expect).
How can I avoid it? And what can cause this?
If it is due to amount of my data upload/download, how can I avoid it?
At the moment, I am using t2.micro type instances.
Simple answer, don't use micro instances.
AWS is a multi-tenant environment as such resource are shared. When it comes to network performance, the larger instance sizes get higher priority. Only the largest instances get any sort of dedicated performance.
Micro and nano instances get the lowest priority out of all instances types.
This matrix will show you what priority each instance size gets:
https://aws.amazon.com/ec2/instance-types/#instance-type-matrix