Recently we have moved our databases to AWS RDS with applications OnPrem, Obviously, latency was huge so we provisioned direct connect with Megaport between AWS Oregon region(RDS) and our data centers(applications) in San Francisco.
But surprisingly we did not see any major difference for latency(please find attached results and below data), It's almost similar to the connection over the internet.
OnPrem App - OnPrem DB (Seconds) Insert: 0.112
OnPrem App - AWS DB Over Direct Connect(Seconds) Insert: 1.332
OnPrem App - AWS DB Over Internet (Seconds) Insert: 1.50
Is this expected?
Do we have any options to improve latency?
Please provide any checkpoints for improvements.
Appreciate your support.
Latency on uncongested routes is largely a factor of distance along with the number of hops in the link.
Assuming, your DC has uncongested connections to the Internet and AWS sure does. For small requests congestion is not going to be an issue and latency will be relatively low.
However that's not guaranteed and could vary from reasonable to terrible at any time. The Internet tends to suffer a degree of packet loss causing re-transmission which increases latency. These effects would be more noticable with a larger traffic volume.
What Direct Connect gets you along with security improvements is a guaranteed latency and bandwidth. Not only is your request marginally quicker, you can ramp up the volume and be sure that the performance won't get any worse.
Megaport publishes latency figures for their part of the network.
Unfortunately there are not any RDS options that would reduce the latency further. Other strategies such as local read-replicas or local caching may be suitable for your application.
Disclaimer: I work for Megaport
Related
We have an rds instance in us-east-1, the applications that access the rds are in us-east-1 and us-east-2. the two regions have vpc peering. we are load balancing the request received by using route53 weighted routing policy. we are experiencing 10 ms delay when communicating across regions. currently, these 10 ms delays are acceptable between microservices. but when the applications in region2 are accessing rds, we are facing a considerable delay. (due to the large amount of database calls of hibernate ). Are there any was to reduce this database latency ?
If you have cross-region databases and reading from slave / read-replicas then expect some delays. This is not technology issue but physics problem as data has to physically travel from A to B. The latency will grow as you move these points (A & B) further away from each other.
A solution to reduce this latency is to use cache layer. All the first reads will be from database (cross-region) but subsequent ones can be from Redis cache for e.g.
We are using Cloud Memorystore Redis instance to add a caching layer to our mission critical Internet facing application. Total number of calls (including get, set and key expiry operations) to Memorystore instance is around 10-15K per second. CPU utilisation has been consistently around 75-80% and expecting the utilisation to go even higher.
Currently, we are using M4 capacity tier under Standard service tier.
https://cloud.google.com/memorystore/docs/redis/pricing
Need some clarity around the following pointers.
How many CPU cores do the M4 capacity tier correspond to?
Is it really alarming to have more than 100% CPU utilisation? Do we expect any noticeable performance issues?
What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)? Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues.
Our application is really CPU intensive and we don't see any way to further optimize our application. Looking forward to some helpful references.
Addressing your questions.
1. How many CPU cores do the M4 capacity tier correspond to?
Cloud Memorystore for Redis is a Google-managed service which means that Google can reserve the inner details(resources) of the virtual machine that is running the redis service. Still it is expected that the higher the capacity tier, the more resources(CPU) the virtual machine will have. For your case in particular, adding CPUs will not solve issues around CPU usage because redis service itself is single threaded.
As you can see from the previous link:
To maximize CPU usage you can start multiple instances of Redis.
If you want to use multiple CPUs you can start thinking of some way to shard earlier.
2. Is it really alarming to have more than 100% CPU utilisation?
Yes, it is alarming to have high CPU utilization because it can result in connection errors or high latency.
CPU utilization is important but also whether the Redis instance is efficient enough to sustain your throughput at a given latency. You can check the redis latency with the command redis-cli --latency while CPU % is high.
3. Do we expect any noticeable performance issues?
This is really hard to say or predict because it depends on several factor(client service, commands run within a time frame, workload). Some of the most common causes for high latency and performance issues are:
Client VMs or services are overloaded and not consuming the messages from Redis: When a client opens a TCP connection to redis then the redis server has a buffer of messages to send to that connection. If a client service has its CPU maxed out, giving no time for the kernel to receive messages from redis then they fill up on the redis server.
The commands executed are consuming a lot of CPU: The following commands are known to be potentially very expensive to process:
EVAL/EVALSHA
KEYS
LRANGE
ZRANGE/ZREVRANGE
4.-What are the options to tackle the performance issues (if any) caused by higher CPU utilisation (>=100%)?
This question revolves mainly around the scaling design of your implementation. Since redis is single threaded, a better approach to reduce CPU % would be by sharding your data in multiple redis instances and have a proxy in front of it to distribute the load. Please take a look at the graph under section Twemproxy from this link.
5.-Will switching to M5 capacity tier address the high CPU consumption and the corresponding issues?
Switching to a higher capacity tier should help with the latency temporarily but this is known as vertical scaling which is limited to the tiers that Cloud Memorystore offers.
Redis Enterprise solves all the issues you are facing. Redis Enterprise can be configured in a clustered configuration and utilize all the resources of the machine as well as scale out over multiple machines.
The Redis Enterprise Software is responsible for watching over the CPU utilization and other resource management tasks so you do not need to.
It is offered on GCP and GCP marketplace as well.
https://redis.com/redis-enterprise-cloud/pricing/
We have 2 Elastic VMs (Linux) (Currently DS2V2) behind an Azure Load Balancer. We are doing HTTP Posts from our local lan into the Load Balancer, but we seem to be getting throttled. We have tried: Changing the size of the VMs, no difference; adding additional premium SSDs, again no difference; running multiple threads on our end, again no differenece.
What we did do though, was to having the Elastic Engine suck in all of the log files from the Linux boxes and the index rate jump pretty high while it was ingesting them. So we are assuming that it's not really the Linux Elastic boxes that are throttling us.
We do have Kibana installed on the boxes, and as a base line, we're just using the "Cluster Indexing Rate" for both our local posts to the box, and the local ingestion of the log files.
We do understand that yes, there is going to be some latency and overhead since we are now involving the internet, but not the rates we are currently getting. (We have a 1G pipe to the internet, it's nowhere near capacity, so we can rule out at least getting out of our company).
The question is, where else can we look to determine where we might be getting throttled?
For the performance "MUCH slower", it is a bit subjective question and hard to identify. I just provide some information that may impact it.
Azure Compute requests may be throttled at a subscription and on a per-region basis. If you have an API throttling error, you could refer to this document to troubleshoot throttling issues, and best practices to avoid being throttled.
Some factors CPU and storage limits that differ on Azure VM sizes may impact the Azure VM to process incoming data. You may change the size to a higher CPU and premium SSD disk. You could also change Azure resources to another region which is close to your location. You could refer to this article.
I'm deploying my cluster on Google Cloud Kubernates service. It already has a few nodes. Also, I need the server with GPU from Google Cloud to make it work with my cluster. GPU instance continuously processes the incoming traffic (bandwidth should be up to 1Gb/s) and sends the results on cluster nodes (bandwidth should be even more than incoming bandwidth).
The most critical things for me in the project:
1) bandwidth between these nodes inside cluster;
2) bandwidth between the node and the GPU server;
3) bandwidth between the GPU server and the world;
4) bandwidth between the node and world.
The minimum appropriate bandwidth for each node is 1 Gb/s on downloading and uploading both. When I make speed tests, it shows download speed 100-680 Mb/s and upload speed 67-138 Mb/s for the same node on the same time (screenshots below were made in period 30 seconds between each other). So the current bandwidth is too small and unstable. But I need stable bandwidth starting from 1 Gb/s.
I tried to find any technical specification or pricing on bandwidth in Google Docs. But, there are only CPU/GPU/RAM/Disk, not bandwidth in the technical specification. And there is only traffic per month pricing on docs.
TL;DR:
How can I set stable 1 Gb/s or more bandwidth for each of the cluster nodes, GPU instance and any other Google Cloud virtual machine?
Is there any service in Google Cloud that provides bandwidth of more than 1 Gb/s?
Is there any solution/service in Google Cloud how to handle big Internet traffic?
P.S. speed tests were made via:
npx speedo-cli
There's no guarantee really, especially when it comes to traffic to/from networks outside GCP. Here's a few things you can do to maximize bandwidth though:
Increase the number of CPU cores per instance:
caps are dependent on the number of vCPUs that a virtual machine instance has. Each core is subject to a 2 Gbits/second (Gbps) cap for peak performance. Each additional core increases the network cap, up to a theoretical maximum of 16 Gbps for each virtual machine. source
Note that the 2 Gbps per vCPU cap represents a theoretical limit using internal networks:
The cap is a limit that can't be exceeded and doesn't indicate the actual throughput of your egress traffic. There is no guarantee that your traffic will achieve the maximum throughput, which depends on many factors other than the cap. source
In case of traffic between VMs (i.e., cases 1 and 2 in your question) make sure the VMs are located in the same zone and you're using internal IPs:
Any time you transfer data or communicate between VMs, you can achieve max performance by always using the internal IP to communicate. In many cases, the difference in speed can be drastic. source
Check this answer on how to measure network bandwidth between VMs using iperf.
For advanced use cases you can try fine-tuning the TCP window size in your VMs.
Finally, one benchmark observed that the GCP network throughput is 81x more variable when compared to AWS. Naturally this just reflects one benchmark but you might find it worthwhile to test other providers yourself.
Since Aleksi's answer there have been some changes to the per-VM egress cap/throttle. It is still computed as 2 Gbit/s * NumberOfvCPUs, but the maximum is now 32 Gbit/s (when the VM is created with min_cpu_platform of skylake or better) and there is a minimum of 10 Gbit/s for VMs with 2 or more vCPUs.
It wasn't clear to me what the endpoints were for your speed test, but one of the (many) limits to the throughput of a TCP connection is:
Throughput <= WindowSize / RoundTripTime
One would expect the GPU instance and the node(s) would be located close to one another, but that limit may come into play for GPU instance and node to the world.
Beyond that, understanding what was happening for the variable throughput likely calls for packet traces, definitely at the sending side, preferably at the receiving side as well. Just the first 96 bytes of each packet would be sufficient in this sort of case. It would be one of the things a support organization would request.
I fear that you can't have any bandwith commitments in mutualized infrastructure. If you have (a lot of) cash, using sole-tenant[1] with all the parts of your architecture on the same tenant can help to solve external parasite.
But event in this case, there is no commitment on network bandwith. And, for now, GPU aren't supported in this solution.
1: https://cloud.google.com/compute/docs/nodes/
Our web application has 5 pages (Signin, Dashboard, Map, Devices, Notification)
We have done the load test for this application, and load test script does the following:
Signin and go to Dashboard page
Click Map
Click Devices
Click Notification
We have a basic free plan in AWS.
While performing load test, till about 100 users, we didn’t get any error. please see the below image. We could see NetworkIn, CPUUtilization seems to be normal. But the NetworkOut showed 846K.
But when reach around 114 users, we started getting error in the map page (highlighted in red). During that time, it seems only NetworkOut is high. Please see the below image.
We want to know what is the optimal score for the NetworkOut, If this number is high, is there any way to reduce this number?
Please let me know if you need more information. Thanks in advance for your help.
You are using a t2.micro instance.
This instance type has limitations on CPU that means it is good for bursty workloads, but sustained loads will consume all the available CPU credits. Thus, it might perform poorly under sustained loads over long periods.
The instance also has limited network bandwidth that might impact the throughput of the server. While all Amazon EC2 instances have limited allocations of bandwidth, the t2.micro and t2.nano have particularly low bandwidth allocations. You can see this when copying data to/from the instance and it might be impacting your workloads during testing.
The t2 family, especially at the low-end, is not a good choice for production workloads. It is great for workloads that are sometimes high, but not consistently high. It is also particularly low-cost, but please realise that there are trade-offs for such a low cost.
See:
Amazon EC2 T2 Instances – Amazon Web Services (AWS)
CPU Credits and Baseline Performance for Burstable Performance Instances - Amazon Elastic Compute Cloud
Unlimited Mode for Burstable Performance Instances - Amazon Elastic Compute Cloud
That said, the network throughput showing on the graphs is a result of your application. While the t2 might be limiting the throughput, it is not responsible for the spike on the graph. For that, you will need to investigate the resources being used by the application(s) themselves.
NetworkOut simply refers to volume of outgoing traffic from the instance. You reduce the requests you are sending from this instance to reduce the NetworkOut .So you may need to see which one of click Map, Click Devices and Click Notification is sending traffic outside of the instances. It may not necessarily related only to the number of users but a combination of number of users and application module.