Client Connections Count on AWS EFS? - amazon-web-services

Based on AWS Documentation, Client Connections is:
The number of client connections to a file system. When using a standard client, there is one connection per mounted Amazon EC2 instance.
Since we have around 10 T3 - EC2 instances running, I would think that ClientConnections would return max of 10.
However, on a normal day, there's around 300 connections and the max we've seen is 1,080 connections.
I have trouble understanding what exactly is Client Connection count.
I initially thought 1 EC2 instance = 1 Connection (Since it only
mounts once, but this doesn't seem to be the case)
Then I thought, it might be per read/write operation. But looking at the graph at the right - read actually dips (we don't have much writes on our website)
Any help appreciated! I believe I might be missing some core concepts, so please feel free to add them in

Client Connection Count refers to the number of IP Addresses(EFS clients) connecting to EFS mount target on a specific NFS port number eg: NFS port 2049
Resource: https://aws.amazon.com/premiumsupport/knowledge-center/list-instances-connected-to-efs/

Related

AWS EC2 instance fails consistently at 30 seconds on long page load

I am running an ECS instance on EC2 with an application load balancer, a route53 domain, and a RDS db. This is an internal business application that I have restricted IP access to.
I have ran this app for 3 weeks with no issues. However, today the data that the web app ingests is an abnormally large size. This is not a mistake. Due to this though, a webpage is taking approximately 4 minutes to complete which I verified on my local machine it completes. However, running the same operation on AWS fails at precisely 30 seconds every time.
I have connected the app running on my local machine to my production RDS db and am able to download and upload the data with no issue. So there is no issue with the RDS db. In addition, this same functionality has worked previously and only failed today due to the large amount of data.
I spent hours with Amazon support to solve this issue but we couldn't figure it out. I am assuming it is a setting for one the AWS services I am using that has a TTL or timeout set to 30 seconds, but I couldn't find it in any of the services I am using:
route53
RDS
ECS
ECR
EC2
Load Balancer
Target Group
You have a backend instance timeout, likely in the web server config.
Right now your ELB has a timeout of 60 seconds, but your assets are failing at 30.
There are only a couple assets on AWS with hardcoded timeouts like that. I'm thinking (because this is the first time it's happened), you have one of the following:
Size limits in the upstream, or
Time limits on connection keep-alive
Look at your website server software (httpd/nginx). Nginx has something called "upstream.conf" where you can set upstream timeouts. I'm not sure of httpd does as well.
Resources:
https://serverfault.com/questions/414987/nginx-proxy-timeout-while-uploading-big-files
From the NLB documentation, maybe relevant
EC2 instances must respond to a new request within 30 seconds in order to establish a return path.
I don't actually know what a return path is, nor what a 'response' is in this context since NLB has no concept of requests or responses.
- https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#connection-idle-timeout
EDIT: Disregard, this must have to do with UDP NATing. 'Response' here is probably a packet going back from the EC2 instance to the client

Ping between aws and gcp

I have created a Site to Site VPN connection between VPC of Google cloud Platform and AWS in North Virginia region for both the VPCs. But the problem is I have been getting a very high ping and low bandwidth while communicating between the instances. Can any one tell me the reason for this?
image showing the ping data
The ping is very high considering they are in a very close region. Please help.
Multiple reason behind the cause :
1) verify gcp network performance by gcping
2) verify the tcp size and rtt for bandwidth
3) verify with iperf or tcpdump for throughput
https://cloud.google.com/community/tutorials/network-throughput
Be aware that any VPN will be traversing the internet, so even though they are relatively close to each other there will be multiple hops before the instances are connected together.
Remember that from the instance it will need to route outside of AWS networks, then to any hops on the internet to GCP and finally routed to the instance and back again to return the response
In addition there is some variation in performance as the line will not be dedicated.
If you want dedicated performance, without traversing the internet you would need to look at AWS Direct Connect. However, this might limit your project because of cost.
One of the many limits to TCP throughout is:
Throughput <= EffectiveWindowSize / RoundTripTime
If your goal is indeed higher throughput, then you can consider tweaking the TCP window size limits. The default TCP window size under Linux is ~3MB. However, there is more to EffectiveWindowSize than that. There is also the congestion window, which will depend on factors such as packet losses and congestion control heuristics being used (eg cubic vs bbr).
As far as sanity checking the ping RTTs you are seeing, you can compare with ping times you see between an instance in AWS us-east-1 and GCP us-east4 when you are not using a VPN.

AWS - EC2 and RDS in different regions is very slow

I'm currently in Sydney and I do have the following scenario:
1 RDS on N. Virginia.
1 EC2 on Sydney
1 EC2 on N. Virginia
I need this to redundation, and this is the simplified scenario.
When my app on EC2 sydney connection to RDS on N. Virgnia, it takes almost 2.5 seconds to give me the result. We can think: Ok, that's the latency.
BUT, when I send the request to EC2 N. Virginia, I get the result in less then 500ms.
Why there is a slow connection when you access RDS from outside the region?
I mean: I can experience this slow connection when I'm running the application on my computer too. But when the application is in the same region that RDS, works quickier that on my own computer.
Most likely you request to RDS requires multiple roundtrips to complete. I.e. at first your EC2 instance requests something to RDS, then something else based on the first request etc. Without seeing your database code, it's hard to say exactly what might be the cause of that.
You say then when you talk to the remote EC2 instance, instead, you get the response in less than 500 ms. That suggests that setting up a TCP connection and sending a single request with reply is 500 ms. Based on that, my guess is that your database connection requires at least 5x back and forth traffic.
There is no additional penalty with RDS in terms of using it out of region, but most database protocols are not optimized for high latency conditions. You might be much better off setting up a read replica in Sydney.
If you are trying to connect the RDS using public-facing network, then it might be slow. AWS launched cross region VPC peering, please peer all the region's VPC (make sure there will not be any IP conflict) and try to connect using private connections.

How to set up a Amazon EC2 instance local network to run a pktgen-dpdk experiment?

I want to run a dpdk experiment using Amazon EC2 service. But there are a great number of services in AWS. I don't know which one to choose.
My experiment need two servers connected together using 10Gbps network adpater supporting dpdk. I run pktgen-dpdk on one server to send packets towards the other server. And another dpdk application will run in the other server to deal with these packets.
I think I can rent servers such c4.8xlarge c4.4xlarge. But I don't know how to set up the local network between them. The local network should have low latency.
Any suggestions will be appreciated! Thank you!
You're looking for Virtual Private Cloud (VPC). An AWS EC2 "instance" like your c4.8xlarge is just a machine. The VPC and several other components allow you to set up a broader network, routing, security groups (basically, a firewall) and other networking capabilities, including in your case a Gateway, which would allow your dpkg system to look out onto the Internet to find dependencies.
The in-network latency is extremely low, < 1ms in our experience. I think most current EC2 instances support 10Gbps networking and other speedy network capabilities.

Is there a limit on outbound TCP connections through a EC2 NAT Instance?

Our setup is as follows:
VPC (with about 30-50 ec2 instances) -> EC2 Nat Instance -> Internet.
Since Dec 13, we have been seeing issues where randomly the connection were starting to refuse. No such issue was seen earlier. Only change is the processing of the urls via API has increased (In other words more TCP connections are getting initiated & worked on). Requesting an API Request (POST/GET/PUT doesn't matter) from an EC2 instance within VPC via NAT Instance to the Internet is failing at random.
I tried logging the Flow logs, but in these flow logs, I see the entry where it shows ACCEPT OK for the TCP log transmission ( pic attached - https://ibb.co/dwe3X6 ). However, the same capture on tcpdump (one specific ec2 instance within vpc), shows the TCP Retransmission failure (where traffic is going through the NAT instance) ( pic attached - https://ibb.co/npqozm ). They are of the same time and same ec2 instance.
Basically, the SYN packet gets initiated, but then the actual handshake doesn't go through. Note, that this doesn't happen all the time.
The tcp retransmission failures are random. Sometimes it works and sometimes it doesn't. So this is leading me to believe there is some sort of a queue or buffer in NAT instance which is hitting the limit and I am not sure how to get to root of this.
This suggests a problem out on the Internet or at the distant end.
ACCEPT means the connection (the instantion of the flow) was allowed by the security group and network ACLs, telling you nothing about whether the handshake succeeded. OK in the flow logs means the log entry itself is intact.
There is no reason to believe that the NAT instance is limiting this, because the SYN packets are indeed shown in wireshark leaving the instance headed for the Internet and the flow log suggests that it is indeed making its way out of the NAT instance successfully.
You used the term "refuse" but the wireshark entries are consistent with a Connection timed out error, rather than Connection refused, which is an active refusal by the far end (or, less commonly, by an intermediate firewall) due to the lack of a listening service on the destination port, which would cause the destination to respond with a TCP RST.
If you can duplicate the problem with a NAT Gateway, then you can be confident that it is not in any way related to the NAT instance itself, which is simply a Linux instance using iptables ... -j MASQUERADE.
The only thing the network infrastructure throttles is outbound connections to destination port 25, because of spam. Everything else is bounded only by the capabilities of the instance itself. With a t2.micro, you should have (iirc) in excess of 125 Mbits/sec of Ethernet bandwidth available, and the NAT capability is not particularly processor intensive, so unless you are exhausting the Ethernet bandwidth or the CPU credit balance of the instance, it seems unlikely that the NAT instance could be the cause.